RSS: Articles| Comments| Trackbacks
 

Java Permgen space, String.intern, XML parsing 10

Posted by haakon, Sat, 09 Sep 2006 08:16:00 GMT

This week I have been poking through the innards of a web application trying to find out why we were leaking memory (in the permanent generation) like crazy. After a bit of digging I isolated it down to a line that looked like this:

Document doc = SAXParser.new().parse( stringContainingXML );

My first inclination was to blame the parser. Everyone knows that XML parsers are troublemakers, right? But, in the end I had to conclude the leak was entirely the fault of our code. But I learned a bit along the way! The details:

Permgen space – what is it?

The memory a jvm uses is split up into three “generations”: young (eden), tenured, and permanent. This is done to improve the performance of garbage collection. Most objects are short lived (local variables, etc), and so they come and go in the young generation. Some objects (like things in caches) stick around for a while and get promoted from the young to the tenured generation. Some things live “forever”, like the classes themselves, and “interned” strings. These go straight into the permanent generation.

Most memory leaks involve normal objects, and you run out of heap space by filling up the young and tenured memory spaces. Sometimes though, you might see “java.lang.OutOfMemoryError: PermGen space failure”. The most common cause is that you simply don’t have enough space to load up all your classes. Use the param ‘-XX:MaxPermSize=100m’ to adjust to a desired value. You may also find that doing a hot deploy of a war into tomcat eventually uses up permgen space. That is a different issue which I won’t discuss here.

If you observe that your app is leaking permgen space just while it is running (and not because you are hot deploying a war), then you have an interesting problem. The issue is most likely to be either an errant ClassLoader, or String.intern gone awry. ClassLoaders are an interesting beast, but our problem was with interned strings.

What is String.intern?

String.intern is an optimization feature. Doing a double equals (==) compare of two strings is a common mistake people make, as they forget that this is doing an identity comparison. (a == b) is checking if a and b are in fact the same object. Usually, what you really want to do is check if (a.equals(b)). This does the character by character comparison that you probably want.

The thing is, the latter comparison is much slower than an identity comparison. So, a nice performance optimization can be to maintain a canonical list of strings that allow you to do the fast identity comparisons instead. It would be easy enough to write such a thing for yourself, but it is included in Java these days with the String.intern method . So Java maintains a pool of these “canonical” strings to allow you to get some better performance when dealing with strings. But, this pool lives in the permgen space!

Why not intern all strings?

A natural question might be why one shouldn’t just intern every string. Well, there are two reasons why this wouldn’t work. One, you have finite memory. If you stored every string you ever saw into permgen space with intern, you would run out of memory reasonably quickly. Secondly, the reason you are using intern in the first place is as a performance optimization. It happens to be faster to retrieve the canonical string from the intern string pool than it is to do a character by character string comparison. However, as the intern string pool grows infinitely large, the cost to find your string in the pool would probably eventually become more expensive than to just do the character comparison. So, you only want to intern strings which you use frequently throughout the life of your app.

XML parsers seem to use String.intern (or something similar)

XML parsing just happens to be a whole lot of string parsing. So, it is not surprising to find that they take advantage if intern. But, we just said that you probably don’t want to intern every string you see, so what does a parser like Xerces intern? According to (http://xerces.apache.org/xerces2-j/features.html), “All element names, prefixes, attribute names, namespace URIs, and local names are internalized using the java.lang.String#intern(String):String method”. These are all the strings that are going to be seen repeatedly when parsing multiple xml documents with the same DTD. Notice, that they don’t intern attribute values, and tag contents. These elements are what change from document to document; they are your actual data. To intern these would be to intern your entire data space, and we would be facing the previously mentioned problem of effectively interning all strings.

Our problem

At last we arrive at our problem. We were parsing XML documents and finding that our permgen was steadily growing. At first we just enlarged permgen, assuming we had a lot of classes to load. But when we were blowing up with 500 megs of permgen space used up, it was time to find the problem.

After a bunch of digging, what we found was this. The XML we were parsing was not really XML. It was well formed (tags opened and closed properly, nested properly, etc). But, it was XML for which it would be impossible to write a DTD because the data lived in the tag space. An example will show it best. We had tags that looked like:

<data.6541237895.field1>field one val</data.6541237895.field1>
<data.6541237895.field2>field two val</data.6541237895.field2>
<data.7813329781.field1>field one val</data.7813329781.field1>
<data.7813329781.field2>field two val</data.7813329781.field2>
...

The numbers inside of the tag itself was data! So, there was no limited, finite number of tags that could exist in an XML document of this form. Rather, you could have as many tags as could be represented by a ten digit number. To make it worse, there were different values like “foobar” and “name” and many others for each number. The details are boring, but the important bit was that our tag space was as big as our data space. The XML parser was merrily interning every tag string it saw as a reasonable performance optimization. But, as our XML was not true XML, everything came crashing down.

So how to fix it?

  1. Maybe the best solution would be to fix the bad XML. In this case, we were not the source of the XML so this ideal option was not practical.
  2. Supposedly one can turn off the “feature” of interning via the SAX parser interface in Java. In practice, none of the parsers we tried allowed us to turn it off (let me know if you find one that does!).
  3. It would be nice if the interned strings could just be garbage collected like any other Java memory. I’ve seen conflicting reports on this. This article appears to show that interned strings can be collected.
  4. Don’t use an XML parser if you aren’t really parsing XML.

Number 4 may seem like a copout, but it is the option we landed on. We now use a few regular expressions to pull the data we need from the “XML” document. This happens to both fix our memory problem, and result in a performance improvent. Apparently selecting just the parts of the document we need with a regex is faster than parsing the whole thing with an XML parser.

How to find these problems

Tracking down these problems can be challenging:

  1. Profiling is your friend. Find a good profiler and learn how to use it (JProfiler works nicely).
  2. jmap and jstat are useful tools that come with the jdk. They give you info about memory usage, etc.
  3. visualgc (jvmstat) is a nice tool for seeing an overall picture of your memory usage.
  4. understand how garbage collection works
    (http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html)
  5. get familiar with jvm args that help with this kind of debugging and performance optimizations. Verbose gc logging, tracing of class loading, etc.
    (http://java.sun.com/docs/hotspot/gc1.4.2/faq.html, http://www.brokenbuild.com/blog/2006/08/04/java-jvm-gc-permgen-and-memory-options/)

~haakon