HellOnline

Icon

Eran's blog

Trials, Tribulations and Garbage Collection

Note: I am by no means an expert on Java Garbage collection. What follows is a description of my attempts to battle memory and gc related issue in our server. If you think I’ve been especially thick about something, missed some obvious options or went completely off-track, please let me know. Also a warning, this one’s kinda long 🙂

During our latest bout of long-running, high-load endurance tests we’ve identified Garbage Collection as a main obstacle. The pattern observed after about 2-3 days of running at high load was very frequent full garbage collection cycles, each causing a full stop of the JVM for several seconds. This of course led to very slow response times from the affected servers. Over the last week we set out to improve the gc behavior in our application.

The Application

  • A highly concurrent web application running on tomcat 5.5 and Java 5.0 on multiple app servers.
  • Using Hibernate to access a MySQL database with a distributed EhCache as 2nd level cache.
  • Most requests are very short lived, less than 100ms response time.
  • Some background processing, in-memory caching of frequently referenced objects and some shared objects pools.
  • Running on dedicated dual core servers with 8GB RAM.

Trials, Tribulations and Garbage Collection.

Short response times are important so we’ve been using the low-pause garbage collector (also known as Concurrent Mark & Sweep or CMS) quite efficiently for shorter runs (around 2 days). Things started falling apart after a longer run that did not employ second level caching. My assumption at the time was that CMS is not particularly efficient at this sort of situation (a large number of short lived objects and a small number of longer lived ones) and so my first step was to re-enable second level caching.

With the cache enabled we did see an improvement in application behavior but there was still very clear degradation in behavior. Most notably the amount of memory freed in Old at every cycle of tenured generation GC was getting smaller and smaller and cycles were getting closer together. CMS does most of it’s work concurrently except for two point during which all mutator threads are paused while the GC marks all live objects. Seeing these cycles happen closer and closer together meant that more and more time was lost to GC and more requests were likely to be delayed for 3-5 seconds while GC was taking over. My first goal then was to figure out why those tenured generation collections were happening so often.

Fortunately the EhCache page about Garbage Collection contained some useful information: Use XX:+DisableExplicitGC to disable any libraries calling system.gc() and forcing full GC cycles. There is a known problem when using RMI in java (pre 6.0 at least) where RMI causes full garbage collection every minute by default. To remedy that set the following parameters to larger values (both are in ms and default to 60000 or 1 minute)

  • -Dsun.rmi.dgc.client.gcInterval
  • -Dsun.rmi.dgc.server.gcInterval

These changes made a noticeable improvement but tenured GC cycles were still happening more and more often. I started digging around some more and learned that CMS starts to collect the tenured GC whenever the Old generation gets to be around 65-68% full. Now things started to make sense. With our GC becoming less efficient over time (possibly because of a memory leak, possibly not), the occupancy rate of Old was creeping up over time and so it was hitting that 65% threshold more and more often causing GC cycles more and more often.

Now that I had a better understanding of what was going on I could started looking at some of our GC data (collected using -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps and jstat -gccause). The first thing that struck me was that the size of our new generation was way too small for our requirements. It appears that the default values for CMS are heavily weighted toward the Old generation. I started experimenting with -XX:NewSize and -XX:MaxNewSize to explicitly allocate more heap to the New generation. I also started setting the total heap size larger.

We started at 3GB total heap (on 8GB RAM machines running nothing else but tomcat) and things seems to run much better at 6GB total heap and 1.5GB allocated to New. At the very least the additional memory bought us some time before we hit the really Bad Times(tm).

I now started looking at slowing down the rate at which the Old generation was filling up. My initial thought was to try and get less objects to tenure at each Young Generation collection. To do that I increased the size of the survivor space (-XX:SurvivorRatio=8 to allocate 10% of New to each of the survivor spaces) and started tracking the age distribution at which objects in New got tenured using +PrintTenuringDistribution. After some toying around with the parameters I got to see some objects live up to age 2 or 3 (initially the maximum age I could see was 1) but I still could not see much of a change in the rate of increase in Old.

We eventually made some sense out of it based on the argument that most of our objects are either very short lived (1 transaction length, around 100ms) and would get reaped right away or very long lived (either pooled or cached) and therefore should be tenured as soon as possible. Based on that reasoning and the VOIP related case study outlined here we’ve settled on immediate tenuring of all objects that survive one Young Generation GC cycle. To achieve that we’ve set -XX:MaxTenuringThreshold=0 and -XX:SurvivorRatio=1024 (the later just to save some memory as survivor space will no longer be used at all).

At this point things settled down a bit and I turned to some fine tuning. Playing around with the size of New space it seemed that we get a good balance of GC frequency and pause length with about 1024m dedicated to new. This kept Young Generation GC frequency to around 4-5s and Tenured Generation between 1.5 and 4 minutes with about 3-5s pauses depending on load.

We made a couple more attempts to fight the slow increase in the size of Old.

  • -XX:+UseCMSCompactAtFullCollection to force compaction of memory at GC time based on the assumption that we may be losing available memory to fragmentation (fragmentation is a known problem with CMS). After running with that setting for a while it seemed to make absolutely no difference.
  • Reducing load on the servers for an extended period of time. We wanted to see if we were encountering some inefficiency in Java’s GC (at least with the parameters we have and under high load). After increasing the load back to initial we saw the usage levels of the Old generation go almost immediately back to where they were before the “rest period” and continue to increase in the same rate.

Based on those two experiments our current assumption is that this slow creep is due to a memory leak. We’ll have to confirm that by some deeper profiling of the application.

Related Parameters and Tools

  • -XX:NewSize and -XX:MaxNewSize – Explicitly set the total size of the New Generation (Eden + 2 survivor spaces).
  • -XX:SurvivorRatio – Set the ratio between the size of Eden and Survivor.
  • -XX:MaxTenuringThreshold – Set the maximum number of times objects will move between Eden and survivor space before being tenured.
  • -XX:+DisableExplicitGC – Disable explicit calls to system.gc() which may be hidden away in some library.
  • -Dsun.rmi.dgc.client.gcInterval and -Dsun.rmi.dgc.server.gcInterval set the interval between full GC cycles caused by RMI.
  • -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps and +PrintTenuringDistribution – Useful for tracking GC behavior in your application.
  • Jstat – when used with -gccause or -gc is another great way to track GC behavior. My favorite command line for jstat:
    • jstat -gccause -h10 1s

Useful resources

Advertisements

Filed under: Java, , , , , ,

Being Evil Online Doesn’t Pay

At least not in the long run. In most cases. Well… here’s a glimmer of hope anyway.

Earlier today, Facebook sent warning to developers whose apps forced users to invite friends in order to use the app, turning off invitations altogether for those apps until that behavior was corrected. This was a major user experience problem that now should exist no more.

I’ve been having this back and forth discussion lately about Send Hotness just one of many Facebook apps that force you to invite X many friends before you get to use it. I wasn’t surprised (although somewhat disappointed) to see the app take off in a viral storm and reach hundreds of thousands of sign-ups or more. Despite their apparent success I maintained that in the long run this kind of behavior will not pay off. Irritating your users is not good policy especially if you’re trying to build an enduring brand. I’m glad to see Facebook cracking down on overly-aggressive applications. In the end this kind of policy will prove beneficial to everyone involved.

Here’s to less SPAM!

Filed under: Social Software,

Fight Piracy with Innovation, not DRM

We’ve all been saying this to the music and movie industries for quite a while. More DRM does not translate to more sales, Innovation does. It seems that the gaming industry is now learning a similar lesson:

As we believe that we are decreasing the number of pirates downloading the game with our DRM fixes, combining the increased sales number together with the decreased downloads, we find 1 additional sale for every 1,000 less pirated downloads. Put another way, for every 1,000 pirated copies we eliminated, we created 1 additional sale.

Though many of the pirates may be simply shifting to another source of games for their illegal activities, the number is nonetheless striking and poignant. The sales to download ratio found on Reflexive implies that a pirated copy is more similar to the loss of a download (a poorly converting one!) than the loss of a sale.

It’s good to see another company learning that the standard rhetoric about how piracy equals lost revenue is almost completely false. Most people who pirate your product would probably never have bought it in the first place. So why ruin your paying users’ experience by more limitations? Instead innovate.

In Korea game piracy has long been a problem. Gaming companies in that market turned mainly to two other business models:

  1. Subscription. Works great for WoW! Moves the authentication and authorization work over to the server and makes pirating your content much harder. Also it gives you a steady revenue with which to keep developing new content, getting your users to stick around longer, pay you more money, etc. Unfortunately for the Korean game market this method did not seem to work very well over there which is many have switch to this second model
  2. Free to play games with micro-transactions. Download the game for free, send it to your friends, install it on as many machines as you’d like then come check out our in-game market, it has some nice power-boosts and unique items that you can’t get anywhere else. This model has been fantastically successful in Korea and is now slowly infiltrating US markets as well.

What’s next? It seems like casual gaming inside social networks (read: facebook) is set to grow but with a so far unclear business model (ads? microtransactions?) That’s fun but here’s what I’m waiting for: with Guitar Hero and RockBand sales going through the roof, I’d love to see the next music CD that I buy come with a bonus Guitar Hero or RockBand track so I can Rawk Out with my friends to some cool new music. That’s the kind of innovation that would actually get me to buy CDs again.

Filed under: gaming, , , ,