HellOnline

Icon

Eran's blog

Trials, Tribulations and Garbage Collection

Note: I am by no means an expert on Java Garbage collection. What follows is a description of my attempts to battle memory and gc related issue in our server. If you think I’ve been especially thick about something, missed some obvious options or went completely off-track, please let me know. Also a warning, this one’s kinda long πŸ™‚

During our latest bout of long-running, high-load endurance tests we’ve identified Garbage Collection as a main obstacle. The pattern observed after about 2-3 days of running at high load was very frequent full garbage collection cycles, each causing a full stop of the JVM for several seconds. This of course led to very slow response times from the affected servers. Over the last week we set out to improve the gc behavior in our application.

The Application

  • A highly concurrent web application running on tomcat 5.5 and Java 5.0 on multiple app servers.
  • Using Hibernate to access a MySQL database with a distributed EhCache as 2nd level cache.
  • Most requests are very short lived, less than 100ms response time.
  • Some background processing, in-memory caching of frequently referenced objects and some shared objects pools.
  • Running on dedicated dual core servers with 8GB RAM.

Trials, Tribulations and Garbage Collection.

Short response times are important so we’ve been using the low-pause garbage collector (also known as Concurrent Mark & Sweep or CMS) quite efficiently for shorter runs (around 2 days). Things started falling apart after a longer run that did not employ second level caching. My assumption at the time was that CMS is not particularly efficient at this sort of situation (a large number of short lived objects and a small number of longer lived ones) and so my first step was to re-enable second level caching.

With the cache enabled we did see an improvement in application behavior but there was still very clear degradation in behavior. Most notably the amount of memory freed in Old at every cycle of tenured generation GC was getting smaller and smaller and cycles were getting closer together. CMS does most of it’s work concurrently except for two point during which all mutator threads are paused while the GC marks all live objects. Seeing these cycles happen closer and closer together meant that more and more time was lost to GC and more requests were likely to be delayed for 3-5 seconds while GC was taking over. My first goal then was to figure out why those tenured generation collections were happening so often.

Fortunately the EhCache page about Garbage Collection contained some useful information: Use XX:+DisableExplicitGC to disable any libraries calling system.gc() and forcing full GC cycles. There is a known problem when using RMI in java (pre 6.0 at least) where RMI causes full garbage collection every minute by default. To remedy that set the following parameters to larger values (both are in ms and default to 60000 or 1 minute)

  • -Dsun.rmi.dgc.client.gcInterval
  • -Dsun.rmi.dgc.server.gcInterval

These changes made a noticeable improvement but tenured GC cycles were still happening more and more often. I started digging around some more and learned that CMS starts to collect the tenured GC whenever the Old generation gets to be around 65-68% full. Now things started to make sense. With our GC becoming less efficient over time (possibly because of a memory leak, possibly not), the occupancy rate of Old was creeping up over time and so it was hitting that 65% threshold more and more often causing GC cycles more and more often.

Now that I had a better understanding of what was going on I could started looking at some of our GC data (collected using -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps and jstat -gccause). The first thing that struck me was that the size of our new generation was way too small for our requirements. It appears that the default values for CMS are heavily weighted toward the Old generation. I started experimenting with -XX:NewSize and -XX:MaxNewSize to explicitly allocate more heap to the New generation. I also started setting the total heap size larger.

We started at 3GB total heap (on 8GB RAM machines running nothing else but tomcat) and things seems to run much better at 6GB total heap and 1.5GB allocated to New. At the very least the additional memory bought us some time before we hit the really Bad Times(tm).

I now started looking at slowing down the rate at which the Old generation was filling up. My initial thought was to try and get less objects to tenure at each Young Generation collection. To do that I increased the size of the survivor space (-XX:SurvivorRatio=8 to allocate 10% of New to each of the survivor spaces) and started tracking the age distribution at which objects in New got tenured using +PrintTenuringDistribution. After some toying around with the parameters I got to see some objects live up to age 2 or 3 (initially the maximum age I could see was 1) but I still could not see much of a change in the rate of increase in Old.

We eventually made some sense out of it based on the argument that most of our objects are either very short lived (1 transaction length, around 100ms) and would get reaped right away or very long lived (either pooled or cached) and therefore should be tenured as soon as possible. Based on that reasoning and the VOIP related case study outlined here we’ve settled on immediate tenuring of all objects that survive one Young Generation GC cycle. To achieve that we’ve set -XX:MaxTenuringThreshold=0 and -XX:SurvivorRatio=1024 (the later just to save some memory as survivor space will no longer be used at all).

At this point things settled down a bit and I turned to some fine tuning. Playing around with the size of New space it seemed that we get a good balance of GC frequency and pause length with about 1024m dedicated to new. This kept Young Generation GC frequency to around 4-5s and Tenured Generation between 1.5 and 4 minutes with about 3-5s pauses depending on load.

We made a couple more attempts to fight the slow increase in the size of Old.

  • -XX:+UseCMSCompactAtFullCollection to force compaction of memory at GC time based on the assumption that we may be losing available memory to fragmentation (fragmentation is a known problem with CMS). After running with that setting for a while it seemed to make absolutely no difference.
  • Reducing load on the servers for an extended period of time. We wanted to see if we were encountering some inefficiency in Java’s GC (at least with the parameters we have and under high load). After increasing the load back to initial we saw the usage levels of the Old generation go almost immediately back to where they were before the “rest period” and continue to increase in the same rate.

Based on those two experiments our current assumption is that this slow creep is due to a memory leak. We’ll have to confirm that by some deeper profiling of the application.

Related Parameters and Tools

  • -XX:NewSize and -XX:MaxNewSize – Explicitly set the total size of the New Generation (Eden + 2 survivor spaces).
  • -XX:SurvivorRatio – Set the ratio between the size of Eden and Survivor.
  • -XX:MaxTenuringThreshold – Set the maximum number of times objects will move between Eden and survivor space before being tenured.
  • -XX:+DisableExplicitGC – Disable explicit calls to system.gc() which may be hidden away in some library.
  • -Dsun.rmi.dgc.client.gcInterval and -Dsun.rmi.dgc.server.gcInterval set the interval between full GC cycles caused by RMI.
  • -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps and +PrintTenuringDistribution – Useful for tracking GC behavior in your application.
  • Jstat – when used with -gccause or -gc is another great way to track GC behavior. My favorite command line for jstat:
    • jstat -gccause -h10 1s

Useful resources

Advertisements

Filed under: Java, , , , , ,