This is an attempt to compare the performance of various object reuse strategies for JMonkeyEngine (and, indirectly, Ardor3D). See this JME forum topic for background info. Also, it’s worth bearing in mind that the main driver for this is to reduce GC pauses, not to improve throughput (this is mentioned in the forum topic).
This is timings from my initial object pooling implementation using 1 thread (all timings throughout this article are in milliseconds):
1
|
|
An interesting
(although unrelated to the topic at hand) thing to note here is that
it’s pretty easy to spot when hotspot kicked in, although the jump in
timings between runs 4 and 5 bears investigating further and the slow
run (run 7) is certainly a worry. Run 7 seems to be consistently slow
and turning on -verbose:gc
doesn’t reveal anything here. This is
timings from my object pooling implementation using 2 threads, note here
that this is each thread doing a fixed chunk of work, not a fixed amount
of work shared between threads (i.e. 2x number of threads = 2x amount of
work):
1
|
|
So the timings are higher although not twice as high as might be expected, this is probably due to that fact that my laptop has a 2-core CPU. Now let’s look at the timings from the same with 10 threads:
1
|
|
Yep, about 5x the time as the 2 thread runs. Now let’s see how the existing implementation compares to these. OK, so now we have an idea how the thread local and pooling based implementation performs, let’s have a look at the existing implementation to give us a baseline to compare to. This is timings from the existing implementation using 1 thread:
1
|
|
Well, that’s quite a bit faster for a single thread, although not an order magnitude type difference, and it didn’t use any locking as it’s running from a single thread, a more realistic implementation would need to have the locks in place in case future code tried to use multiple threads. Let’s have a look at 2 threads now:
1
|
|
Oops! This is about what we’d expect to see, a doubling of the amount of work doubles the amount of time needed (as the quaternion class is now a shared resource). And 10 threads:
1
|
|
OK, I got bored of waiting after 2 runs! But it’s clear to see that it’s much
slower. Finally, based on suggestions from vear and also looking at the
code used in the Javolution library (a set of real-time Java
classes), I decided to try a version that reduced the number of thread
local look-ups needed, this comes at the cost of not providing a single
reusable ObjectPool
class, but as that class is pretty trivial
anyway it’s no great loss leaving it out of the framework.
1 2 3 |
|
Wow! it’s pretty clear that the pooled approach is much
faster and that the cost of performing the thread local look-up is
fairly significant. Interestingly I also tried this using raw arrays
instead of ArrayList
s and it was much slower, I can only surmise
that because ArrayList
is so heavily used throughout Java it gets
insanely optimised by hotspot. As a side note, here’s my Java
environment:
1 2 3 4 5 6 7 |
|
And the code used for these tests is available here. I also tried this using 1.5 with both the server and client VMs, the 1.5 server VM is noticeably slower and the 1.5 client VM is frankly a dog, it was 5-6 times slower than the 1.6 times given here.