1. An Improved Hardware Acceleration
Scheme for Java Method Calls
Tero S¨ ntti∗† , Joonas Tyystj¨ rvi∗‡ , and Juha Plosila†∗
a a
∗ Dept.
of Information Technology, University of Turku, Finland
† Academy of Finland, Research Council for Natural Sciences and Engineering
‡ Turku Centre for Computer Science, Finland
{teansa|jttyys|juplos}@utu.fi
Abstract— This paper presents a significantly improved strat- This work is a part of the VirtuES project, which focuses
egy for accelerating the method calls in the REALJava co- on fully utilizing the potential of embedded multicore systems
processor. The hardware assisted virtual machine architecture is using a virtual machine approach.
described shortly to provide context for the method call accelera-
tion. The strategy is implemented in an FPGA prototype. It allows Overview of the paper We proceed as follows. In Section
measurements of real life performance increase, and validates 2 we shortly describe the structure of our hardware assisted
the whole co-processor concept. The system is intended to be JVM, and show how the proposed co-processor fits into the
used in embedded environments, with limited CPU performance Java specifications. Section 3 describes the methods in Java
and memory available to the virtual machine. The co-processor and sheds light on the differences in the ways methods can be
is designed in a highly modular fashion, especially separating
the communication from the actual core. This modularity of the invoked. In Section 4 the strategy for accelerator is presented
design makes the co-processor more reusable and allows system with details of the hardware unit focusing on the differences
level scalability. This work is a part of a project focusing on design to the previous solution. In Section 5 some benchmark results
of a hardware accelerated multicore Java Virtual Machine for are given and analyzed. Finally in Section 6 we draw some
embedded systems. conclusions and describe the future efforts related to the
REALJava virtual machine.
I. I NTRODUCTION
Java is very popular and portable, as it is a write-once run- II. JAVA V IRTUAL M ACHINE
any-where language. This enables coders to develop portable In the Java Virtual Machine Specification [4], Second Edi-
software for any platform. Java code is first compiled into byte- tion the structure and behavior of all JVMs is specified at
code, which is then run on a Java Virtual Machine (hereafter a quite abstract level. This specification can be met using
JVM). The JVM acts as an interpreter from bytecode to native several techniques. The usual solutions are software only,
microcode, or more recently uses just in time compilation (JIT) including some performance enhancing features, such as JIT
to affect the same result a bit faster at the cost of memory. (Just In Time Compilation). We have chosen to use a HW/SW
This software only approach is quite inefficient in terms of combination [7] in order to maximize the hardware usage and
power consumption and execution time. These problems rise minimize the power consumption.
from the fact that executing one Java instruction requires
several native instructions. Another source for inefficiency is
the memory usage. The software based JVMs have to keep
internal registers of the virtual machine in the main memory
of the host system. When the execution of the bytecode is
performed on a hardware co-processor this is avoided and the
overall amount of memory accesses is reduced. Because the
methods in Java are generally quite small in terms of storage
requirements for the code that is running and the data being
processed it is possible to keep all the required items in a Fig. 1. Internal architecture of the REALJava JVM
relatively small local memory inside the co-processor. Actually
just 128 kb of internal memory is enough to store all of the The HW portion (shown on the right side of Figure 1)
methods used in an embedded application. This includes the handles most of the actual Java bytecode execution, whereas
Java benchmark for embedded systems found in [15] and the the SW portion (the left side of Figure 1) takes care of memory
embedded version of the CaffeineMark [14]. Since this local management, class loading and native method calling. This
memory is not mirrored to the main memory, which usually partitioning gives the possibility to use the co-processor with
resides in a physically external memory chip, it is energy any type of host CPU(s) and operating systems, as all of the
efficient. platform dependent properties are implemented in software
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. and most of the platform independent bytecode execution is methods can be found in [11]. An independent study can also
done in hardware. be found in [1].
Because Java supports multithreading at the language level,
it makes sense to integrate several co-processors as a SoC. Salesman Sort Raytrace Caffeine
Stack frame size 8.98 4.77 7.01 5.55
This gives an ideal solution for complex systems running Method length 38.86 8.67 9.26 14.83
several Java threads and possibly some native code at the same Total invocations 991228 18412516 1957996 27779867
time. This approach brings forth true multithreading and thus TABLE I
improves performance. Also large systems possibly contain S TATISTICS FROM METHOD INVOCATIONS IN SELECTED BENCHMARKS .
several software subsystems, such as internet protocols, user T HE FIRST TWO ROWS ARE AVERAGES AND THEY ARE MEASURED IN
interface controllers and so on, which can easily be coded 32- BIT WORDS .
in Java, and since they all are executed in parallel the user
experience is enhanced. III. M ETHOD CALLS IN JAVA
The system architecture can be chosen to be a network of The Java virtual machine specification [4] defines the types
any kind or bus based, as suitable for other components in of methods that can be invoked in Java. Because Java is an
the system. The structure of the underlying communication object-oriented language, methods are usually invoked on ob-
medium is rather irrelevant, as long as the lower level provides jects, with the actual method implementation chosen based on
two properties: 1) the datagrams must arrive in their destina- the runtime type of the object. Methods that are not invoked on
tion in the same order that they were sent, and 2) the datagrams objects are called static methods. Besides static methods, the
arriving from two different sources to the same destination most important categories of methods are defined in the access
must be identifiable. The first property can be be achieved flags bit field of the method definition. The most important
with a lower level network protocol, like ATM adaptation access flags during bytecode execution are acc synchronized
layer (AAL) for the internet, or by the physical structure of a and acc native. Acc synchronized means that when the method
bus. The second property seems quite natural, and should be is invoked, the monitor (the primary synchronization construct
present in all solutions. The communication scheme for the in Java) of the object that the method is invoked on is
co-processor is discussed in more detain in [5]. entered, and the monitor is exited on return from the method.
The architecture for the co-processor is presented in [6] and Acc native means that the method is implemented in a native
the whole system including hardware and software portions language of the platform. Native methods can be bound to
can be found in [7] and [10]. The basic design used for the actual native functions at runtime.
FPGA implementations in this paper is the same, with only Methods are invoked using one of the four bytecode
minor fine tuning on some of the units. There are 5 control instructions invokevirtual, invokespecial, invokeinterface and
registers in the execution unit. These are the program counter invokestatic. All of these instructions perform a method lookup
P C, stack top pointer ST , code offset CO, local variable based on a 16-bit index to the constant pool of the currently ex-
pointer LV and local variable info LO. The P C holds the ecuting class. Invokevirtual and invokeinterface then perform
address of the current instruction relative to the CO. The ST a further lookup based on the runtime type of the object that
and LV registers are internal addresses to the local memory. the method is being invoked on, while invokestatic and invoke-
The CO contains the starting address of the current method special invoke the method found immediately. As symbolic
in the method area of the co-processor. The last register holds method resolution is very slow, it is common to modify the
two values, the number of parameters Nparams and the number constant pool and the instruction data itself during either class
of local variables Nlocals in for the current method. After loading or after the execution of a call instruction. A common
applying the new method invocation structure, the LO register technique for accelerating invokevirtual instructions is the use
is removed from the design. of virtual tables [2], which contain a pointer to each non-
The Java virtual machine also provides a rich standard interface method that a class implements, with a fixed index
library. In most current research virtual machines the GNU for each method identifier. Performing a virtual table lookup
Classpath [16] is used. The GNU Classpath is a free imple- is much faster than finding the method by symbolic lookup in
mentation of the standard library, and it is constantly being the class of the object. A somewhat related technique is so-
developed. Currently it covers more than 95% of the methods. called “inline caches” [3], which enable just-in-time compilers
The missing methods are quite rare, so in most cases the to quickly inline the most common implementations of virtual
GNU Classpath is sufficient. As per recommendations for Java methods into their call sites.
programming, the classpath has been build from very small As Java is an object-oriented language, invokevirtual is
methods, which are invoked often during the execution of a intended to be the primary method invocation instruction. The
Java program. Also many of the methods in the classpath call other instructions are used for special cases: invokespecial is
even smaller sub-methods. This emphasizes the importance used to invoke object constructors, private methods (which can
of having fast method invocation architecture in a virtual be hidden by subclasses) and to explicitly invoke a certain
machine. The method size statistics for selected benchmarks implementation of a virtual method, invokeinterface is only
are shown in Table I, and they clearly support the claim of used to invoke methods through an interface pointer and
small methods being invoked often. More statistics about Java invokestatic does not operate on an object instance. It is
3. important to notice that as long as a class has no subclasses, The symbol X is just a shorthand for Nlocals − Nparams .
invokevirtual can be executed like invokespecial. The same
applies for interfaces with only one implementation.
If the overload status of virtual functions is stored in the
method definition and updated when new classes are loaded,
three types of method invocation instructions can be executed
without access to heap data or native functions: non-native
invokestatic, invokespecial and invokevirtual with a single
implementation. These instructions can be implemented using
only a constant pool lookup and, in the case of invokevirtual,
a test of the overload status of the method.
The new architecture presented in this paper makes use of
an observation about Java programs we made recently. We
noticed that the stack of a given Java method is always empty
when a return instruction is executed. This feature is not
mentioned in the Java virtual machine specification [4], but
it seems that not one of the Java compilers we tried generates
code where the stack would not be empty. Assuming the stack
to always be empty makes the return much simpler, but we Fig. 2. The effects of the invocation process on the stack.
were hesitant due to the fact that a class with non empty
stack during return would still be a legal construct. When a
Now let us review the mechanism presented in [9]. In
bytecode modification engine [12] was added to the bytecode
the following formulas the CallInfo vector comes from the
verification phase, it was noticed that it could be used for
invoker module shown in Figures 3 and 4. In the original
emptying the stack if required. The bytecode verification keeps
architecture the CallInfo was 56-bits long. The SWCT RL
count of the stack at all points of the bytecode, so adding just
symbol is used for control bits that tell both the hardware and
the required amount of pop instructions before an offending
the software that some special actions are required during the
return instruction would fix the situation. So far this has never
return phase of the method. An example would be a return
been observed, but the check up is kept in the verification
to a native method. This situation cannot be handled in the
process for sake of security and in order to be compliant with
hardware, since the control is returned to the native method
all legal Java code.
executed by the CPU. Please notice that pushing the return
Returning from a process happens using one of 6 instruc-
info to the stack after the new register values have been
tions, return, ireturn, freturn, areturn, lreturn or dreturn. These
calculated updates the ST accordingly.
differ only by the data pushed to the stack of the calling
method. The first one pushes nothing, while the next three
Formulas for calculating the new register values:
push one word and the last two push two words. Even though
PC ⇐ 0
the 32-bit versions have several bytecodes reserved, they
ST ⇐ STOLD − CallInf o(15..0) + CallInf o(31..16)
are implemented using only one mechanism. The difference
CO ⇐ CallInf o(55..32)
between these instructions is only used during class loading
LV ⇐ STOLD − CallInf o(15..0)
for verification purposes. The 64-bit instructions are handled
LO ⇐ CallInf o(31..0)
similarly. Since the actual returning process is exactly the
same for all of the instructions, we only consider the return
instruction, and state that the data to be pushed to the calling Data pushed to the stack frame (Return Info):
method stack is stored into temporary registers during the SWCT RL & P COLD
return process. STOLD − CallInf o(15..0)
COOLD
LVOLD
IV. I NVOCATION AND RETURN PROCEDURES
LOOLD
First, let us have a look at what happens in the stack of
the virtual machine during a method invocation. Figure 2 Then we can move on to the new method invocation
shows how the new stack frame is created. Before the actual procedure. Here are the modified invocation formulas. These
invocation, the calling method pushes the required parameters use the new architecture, so the CallInfo is only 16-bits long,
to the top of its stack. In the Figure these are shown as and it is presented in Figure 5.
Parameters, and the number of them is denoted with the
symbol Nparams . The symbol Nlocals tells how many local Formulas for calculating the new register values:
variables the new method uses. Note that the parameters PC ⇐ 0
become a part of the local variable array for the new method. ST ⇐ STOLD + X
4. CO ⇐ CallInf o(15..0)
LV ⇐ STOLD − Nparams
Data Pushed to the stack frame (Return Info):
SWCT RL & COOLD
LVOLD & P COLD
Naturally the procedures for performing a return instruction
are also simplified in the new architecture. The process of
returning from a method can be seen as going to the opposite
direction in Figure 2. The original values for most of the
registers associated with the stack frame on the left side
are returned. Only the ST is modified, to reflect the fact
that the parameters consumed by the invoked method have
been removed from the stack. The old system performed the Fig. 3. The invoker connected to the ALU and the registers.
following actions during return instructions.
Formulas for calculating the new register values: method id and the code offset as the key, as shown in Figure
P C ⇐ Data0 4. In the old architecture the code offset was 24 bits long. In
ST ⇐ Data1 the new architecture the software performs a process called
CO ⇐ Data2 constant pool merging, during which the constants defined by
LV ⇐ Data3 a given class are added to a global constant pool instead of a
LO ⇐ Data4 separate pool for each class. This saves memory by merging
constants already defined by other classes and also speeds up
Where the DataN symbols are being retrieved from the the constant pool look-up, since finding the constant pool for
stack frame using a separate indexing scheme, which offsets the current class is not required. This technique also reduces
the index by Nlocals and then uses the normal local variable the size of the CAM key, because the code offset of the
loading mechanism inside the co-processor. Altogether this current method is no longer needed. Only the method id,
sequence requires 5 data items to be retrieved from the data now in the new unified constant pool, is required. The new
memory. structure can be seen in Figure 5. As an additional bonus,
the method cache utilization is improved. This happens while
The modified architecture gets the same results in a invoking one method, let us call it A, from several different
much simpler fashion, using the following formulas: classes. The scenario results in only one cache line, while the
P C ⇐ Data0(15..0) old architecture would have required a separate line for each
ST ⇐ LVOLD method invoking A.
CO ⇐ Data1(1..0)
LV ⇐ Data0(31..16)
Again the DataN symbols are retrieved from the data
memory, but now they can be retrieved using the normal pop
mechanism. This simplifies the hardware since now the unit
handling local variables is not required to handle the additional
offsetting. The amount of data to be retrieved is also decreased Fig. 4. The original CAM structure.
from 5 to just 2 words. This naturally decreases the amount of
memory required for return information in each stack frame
from 5 to 2 words. Using the pop mechanism is possible since
we are now assuming that the stack of the current method is
empty when performing the return. Notice also, that now the
new value for the ST is not calculated at all, but it is simply
the value of the LV in the current method.
The invoker will speed up the invocation of methods that
are already loaded to the local memory of the co-processor. Fig. 5. The modified CAM structure.
When an invocation command is encountered in the ALU, it
sends the constant pool index of the method to the invoker After the key has been found in the CAM, the match address
module and sets query high. At this time the invoker performs is sent to a normal RAM, which stores the information needed
a look up in the content addressable memory (CAM) using the to perform the method call. This RAM was 56 bits wide, and
5. Processor REALJava (old) REALJava (new) Kaffe on PPC Units Gain %
Engine speed 100 100 300 MHz N/A
Simple call 3125000 6666666 59453 1/s 113.3
Instance call 713042 1160730 19460 1/s 62.8
Synchronized call 366473 564810 15567 1/s 54.1
Final call 671343 1097950 18090 1/s 63.5
Class call 671255 1248860 18847 1/s 86.0
Synchronized class call 260401 350181 1/s 34.5
Salesman 11438 9027 111824 ms 26.7
Sort 40569 31386 856684 ms 29.3
Raytrace 7205 5494 169646 ms 31.1
EmbeddedCaffeineMark 156 231 10 48.1
EmbeddedCaffeineMark ND 184 279 11 51.6
TABLE II
R ESULTS FROM VARIOUS BENCHMARKS .
consisted of 24 bits for the code offset of the new method, be loaded. Overloading of methods causes them to fall out of
16 bits for the number of local variables, Nlocals , and finally the cache because selecting the implementation for a specific
16 bits for the number of parameters, Nparams , taken by the call requires access to heap data. The host CPU is better suited
new method. Our improved scheme stores only the code offset for this kind of task, so it is assigned to there.
for the method to be invoked. The length is now limited to The module was integrated into our REALJava co-processor
16 bits. In stead of storing the Nlocals and Nparams to the prototype as 8 places deep. This depth was chosen as the
invoker cache, a different approach is chosen. Namely the statistics presented in [9] show that size to provide highest
Nparams and X are stored to the instruction memory, just impact on performance with least resources. The prototype
before the actual Java code for the method. The value of X is is based on a Xilinx ML410 demonstration board. This board
calculated during class loading to minimize the computation provides all the services one might expect of a computer, such
required during runtime. This strategy minimizes the size of as a network controller, a hard drive controller, a PCI bus and
the method cache unit. The increased memory requirement so on. The FPGA chip is a Virtex4FX, which includes two
for the instruction side is only one word per each method, hardcore PowerPC CPUs. The co-processor is connected to
and since the stack frames have been reduced by 3 words, the CPU via the Processor Local Bus (PLB 3.4). The system
the net effect is positive even if each method is invoked runs the co-processor at 100 MHz, while the PowerPC CPU
only once. Naturally, if there are subsequent invocations for runs at 300 MHz. The CPU runs Linux 2.4.20 as the operating
a method already loaded to the instruction memory, the net system providing services (network, filesystem, etc.) to the
saving resulting from the new architecture is 3 words per virtual machine. For more details on the prototype, please see
invocation. The code offset found from the RAM is also sent [8] and [10]. The system has also been implemented on a
to the instruction memory controller, which in turn returns Virtex5 based board. This configuration used the newer PLB
the values for Nparams and X to the ALU for use in the bus (4.6) as the communication channel and a MicroBlaze
invocation. If a key is found, then get regs signal is set high as the CPU. The CPU provides considerably less arithmetic
to indicate a valid match. This triggers the ALU to capture performance, since it is a softcore processor implemented
the CallInfo and the Nparams and the X, and to calculate using FPGA resources and runs at 100 MHz. The larger FPGA
new register values using the rules presented earlier. chip allowed us to include eight co-processor cores to the
In case a match is not found in the CAM, a trap is produced. system. Unfortunately we have only implemented the new
To indicate this condition to the ALU the do trap signal is invocation architecture on that platform, so we do not present
set high. Upon receiving this signal the ALU sets the trap detailed results for this platform in this paper.
signal high to communication module, and finally the host
CPU performs the needed actions to start execution of the V. R ESULTS
new method. At the same time the invoker module saves the The results in Table II show that the invoker module has
key to the CAM. When the execution resumes after the trap significant impact on execution times of the benchmarks. In
the invoker module captures the required register values and the table REALJava (old) stands for a configuration with the
saves them to the RAM. Now the invoker is ready to speed up original invoker, REALJava (new) stands for a configuration
execution in case the same method is called again. When the with the improved invoker and Kaffe on PPC is Kaffe Virtual
invoker module saves a new key to the CAM it uses circular Machine running on the same PowerPC processor. REALJava,
oldest algorithm to choose which entry to replace. This scheme even though running at lower clock speed, clearly outperforms
provides reasonably close approximation of the least recently the Kaffe in all of the benchmarks. The Gain is the percentage
used algorithm with very low complexity. of improvement achieved with the improved invoker module.
The invoker module can also clear its contents. This is The first set of benchmarks is a collection of method call
required for situations where a virtual method has been cached tests. They measure mostly the method invocation perfor-
to the module, and a new overloading virtual method needs to mance, and do not include (significant amounts of) arithmetics.
6. The first one simply calls an empty method and then returns frames, thus reducing the overall memory requirements for the
without any processing inside the invoked method. The next co-processor. Also the hardware is simplified, since the LO
5 are taken from the Java Grande Suite [17] to show the per- register and the offsetting mechanism for local variables were
formance gains for various method types. These benchmarks removed.
contain a few Java instructions inside the invoked methods, We plan to continue refining the REALJava virtual machine.
so some time is spend performing actual arithmetics. The Currently we are mostly focusing on improvements to the
arithmetic speed of the JPU is exactly the same for both software partition, but the hardware is also evolving at the
versions, which explains the lower gain percentages in these same time. On the hardware side the most interesting new
test when compared to the simple call test. topic we are studying is making the co-processor core into a
The next set of benchmarks is a collection of tests that reconfigurable module and providing system level support for
have been written to evaluate real life performance. The dynamically adding and removing co-processors as needed.
benchmark programs do not contain any special optimizations This kind of a system could better utilize the resources on a
for our hardware. Short descriptions of the benchmarks fol- given FPGA by providing several special purpose cores to be
low. Salesman solves the traveling salesman problem using a used based on the user application.
naive try all combinations method, Sort tests array handling
ACKNOWLEDGMENT
performance by creating arrays of random numbers and then
sorting them, and Raytrace renders a 3D sphere above a plane. The authors would like to thank the Academy of Finland
As the benchmarks emphasize different aspects of the system, for their financial support for this work through the VirtuES
together they should give a rather good estimation of different project.
practical applications that might be found on an embedded R EFERENCES
Java system. The results show 26 to 31 percent improvement
[1] S. Byrne, C. Daly, D. Gregg and J. Waldron, “Dynamic Analysis of the
in the execution speed with the new invocation module. Java Virtual Machine Method Invocation Architecture”, In Proc. WSEAS
Several websites and research papers dedicated to Java 2002, Cancun, Mexico, May 2002.
execution have used the CaffeineMark as a performance mea- [2] O.-J. Dahl and B. Myhrhaug, “Simula Implementation Guide”, Publica-
tion S 47, Norwegian Computing Center, March 1973.
surement. The CaffeineMark is also available as an embedded [3] J. Lee, B. Yang, S. Kim, K. Ebcio˘ lu, E. Altman, S. Lee, Y. C. Chung,
g
version, which omits the graphical tests from the desktop H. Lee, J. H. Lee, and S. Moon, “Reducing virtual call overheads in a
version. The test scores are calibrated so that a score of 100 Java VM just-in-time compiler”, In SIGARCH Comput. Archit. News 28,
1, pp 21-33, March 2000.
would equal the performance of a desktop computer with [4] T. Lindholm and F. Yellin, “The Java Virtual Machine Specification”,
133 MHz Intel Pentium class processor. The individual tests Second Edition, Addison-Wesley, 1997.
cover a broad spectrum of applications. Since the REALJava [5] T. S¨ ntti and J. Plosila, “Communication Scheme for an Advanced Java
a
Co-Processor”, In Proc. IEEE Norchip 2004, Oslo, Norway, November
is intended for embedded systems, we also calculated the 2004.
scores without the floating point sub-test. These scores are [6] T. S¨ ntti and J. Plosila, “Architecture for an Advanced Java Co-
a
reported in Table II on the line marked with ND (No Double Processor”, In Proc. International Symposium on Signals, Circuits and
Systems 2005, Iasi, Romania, July 2005.
arithmetics). These results are marked with italics because [7] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “Java Co-Processor for Embedded
a a
they were measured using a new version of the software par- Systems”, In Processor Design: System-on-Chip Computing for ASICs
tition of the REALJava virtual machine, which contains some and FPGAs, J. Nurmi, Ed. Kluwer Academic Publishers / Springer
Publishers, 2007, ch. 13, pp. 287-308, ISBN-10: 1402055293, ISBN-13:
modifications besides the invocation architecture. Because of 978-1402055294.
this, the results do not give an accurate view on the effect [8] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “FPGA Prototype of the REALJava
a a
of the new invocation architecture. For reference we give Co-Processor”, In Proc. 2007 International Symposium of System-on-
Chip, Tampere, Finland, November 2007.
the scores for the Virtex5 based system, which are 142 and [9] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “A Novel Hardware Acceleration
a a
198 for the embedded CaffeineMark with and without double Scheme for Java Method Calls”, In Proc. ISCAS 2008, Seattle, Washing-
arithmetics. These results show a decrease from the PowerPC ton, USA, May 2008.
[10] T. S¨ ntti, “A Co-Processor Approach for Efficient Java Execution in Em-
a
based system, which is due to the significantly slower CPU. bedded Systems”, Ph.D. thesis, (https://oa.doria.fi/handle/10024/42248),
Naturally this test was run using only one core on that system, University of Turku, November 2008.
although eight of them could be used in parallel. More results [11] J. Tyystj¨ rvi, “A Virtual Machine for Embedded Systems with a Co-
a
Processor”, M.Sc. Thesis, University of Turku, 2007.
can be found at our results site [13], the invocation architecture [12] J. Tyystj¨ rvi, T. S¨ ntti and J. Plosila, “Instruction Set Enhancements for
a a
was changed and fine tuned between versions 2.09 and 3.01 High-Performance Multicore Execution on the REALJava Platform”, In
of the REALJava. Proc. NORCHIP 2008, Tallinn, Estonia, November 2008.
[13] “BenchMark Results”, consulted 18 August 2010,
VI. C ONCLUSIONS AND F UTURE W ORK http://vco.ett.utu.fi/˜teansa/REALResults.
[14] “CaffeineMark 3.0”, consulted 18 August 2010,
An improved strategy for accelerating of method calls in http://www.benchmarkhq.ru/cm30/.
[15] “Embedded Java Book Index”, consulted 18 August 2010,
Java using a hardware module was presented. The module http://www.practicalembeddedjava.com/.
was implemented on Xilinx FPGA to provide several bench- [16] “GNU Classpath”, consulted 18 August 2010,
marks and show significant improvement in both specialized http://www.gnu.org/software/classpath/.
[17] “JavaG Benchmarking”, consulted 18 August 2010,
and more general benchmarks. In addition to the improved http://www2.epcc.ed.ac.uk/computing/research activities/java grande/
performance, the new architecture reduces the size of the stack