Monday, January 5, 2009

Busting java.lang.String.intern() Myths

(Or: String.intern() is dangerous unless you know what you're doing!)

Update: The subject discussed in this article was true up to Java 6. In Java 7, interned strings are no longer allocated in the permanent generation, but in the main Java Heap. I've decided to leave this article here for historical purposes, but keep in mind that the allocation of interned strings in the main heap makes the intern() method an appealing feature to prevent string explosion in the heap.

If you ever peeked through the Javadoc or the source code of the Java String class, you probably noticed a misterious native method called intern(). The javadoc is very concise. It basically says that the method returns a representation of the String that is guaranteed to be unique through the JVM. If two String objects are internalized, they can be safely compared with == instead of equals. This description gives two reasons to use intern(): because comparision becomes faster and because there can be potential memory usage improvements because you wouldn't waste the heap with lots of equivalent strings.

The two reasons above are closer to myth than reality. The performance myth doesn't cause any harm, it is just that the gain is not that big as one would think it would be. But the memory usage improvement myth is where the danger lies: by trying to improve memory usage, one can actually end up causing OOM errors in the application.

Let's look at those myths with more detail.

Myth 1: Comparing strings with == is much faster than with equals()

An industrious developer could think of internalizing strings for performance reasons: you call intern() once, even if though it can be a costly operation, but then after that you can always compare the strings with ==. What a performance improvement it must be!

I wrote a quick benchmark to compare both approaches. It turns out that comparing strings with average length of 16 characters using equals is approximately only 5 times slower than comparing with ==. Even though a 5 times difference is still a large number, you may be surprised that the gap isn't bigger. There are two reasons for this. First, String.equals() only compares the characters as a last effort: it first compares the length of the strings, which is stored in a separate field, and only if the lengths are the same it starts comparing the characters, but it halts as soon as it finds the first non matching character.

Another reason for the relatively small difference between the two approaches is that the HotSpot optimizer does a very good job of optimizing method calls, and String.equals() is a very good candidate for inlining since it is a small method that belongs to a final class. That removes any overhead related to method calls.

Now, == provides a 5-fold improvement over equals(). But since String comparision usually represents only a small percentage of the total execution time of an application, the overall gain is much smaller than that, and the final gain will be diluted to a few percent.

So Myth 1: busted! Yes, == is faster than String.equals(), but in general it isn't near a performance improvement as it is cracked up to be.

Myth 2: String.intern() saves a lot of memory

This myth is where the danger lies. On one hand, it is true that you can remove String duplicates by internalizing them. The problem is that the internalized strings go to the Permanent Generation, which is an area of the JVM that is reserved for non-user objects, like Classes, Methods and other internal JVM objects. The size of this area is limited, and is usually much smaller than the heap. Calling intern() on a String has the effect of moving it out from the heap into the permanent generation, and you risk running out of PermGen space.

I wrote a small test program that confirm this (see below). The call to Thread.sleep(1000) is so that you can see the permanent generation going up in a profiler. You can check it yourself by running this program and then running jconsole which is available in the JDK distribution. Go to jconsole's Memory tab and select Memory Pool "Perm Gen" in the dropdown box. You will see the permanent generation going up steadly until the process terminates with a java.lang.OutOfMemoryError: PermGen space.
import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) throws Exception {
        int steps = 1000;
        String base = getBaseString();

        List strings = new ArrayList();
        int i = 0;
        while (true) {
            String str = base + i;
            str = str.intern();
            strings.add(str);
            i++;
            if (i % steps == 0) {
                Thread.sleep(1000);
            }

        }
    }

    private static String getBaseString() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < 1000; i++) {
            builder.append("a");
        }
        return builder.toString();
    }
}

So Myth 2: busted! String.intern() saves heap space, but at the expense of using up the more precious PermGen space.

Myth 3: internalized strings stay in the memory forever

This myth goes in the opposite direction of myth 2. Some people belive that internalized strings stay in the memory until the JVM ends. It may have been true a long time ago, but today the internalized strings are garbage collected if there are no more references to them. See below a slightly modified version of the program above. It clears the references to internalized strings from time to time. If you follow the program execution from jconsole, you will see that the PermGen space usage goes up and down, as the Garbage Collector reclaims the memory used by the unreferenced internalized strings.
import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) throws Exception {
        int steps = 1000;
        String base = getBaseString();

        List strings = new ArrayList();
        int i = 0;
        while (true) {
            String str = base + i;
            str = str.intern();
            strings.add(str);
            i++;
            if (i % steps == 0) {
                Thread.sleep(1000);
            }

            if (i % (steps * 4) == 0) {
                strings = new ArrayList();
            }
        }
    }

    private static String getBaseString() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < 1000; i++) {
            builder.append("a");
        }
        return builder.toString();
    }
}

Myth 3: Busted! Internalized strings are released if they are no longer referenced.

Note: when == is worth over equals()

If you are doing heavy text processing you may want to internalize strings. But in this case, you are probably better off using an approach that I outlined here: Weak Object Pools With WeakHashMap. With this approach, you get the benefit of having unique Strings, but without the penalty of using up the PermGen space.

Conclusion: always know what you are doing

As I said in the subtitle of this article, String.intern() is dangerous if you don't know what you are doing. Now you know the risks of using String.intern(), and you will be able to make a more informed decision about whether to use it or not.

More information

  • The OpenJDK source code is the JVM Hacker's heaven. It is not easy reading though, and you need to know C++ to take advantage of what it has to offer.
  • Presenting the Permanent Generation is a very good article about what is the PermGen and what goes inside it.


10 comments:

nanda said...

In case you haven't read it: http://kohlerm.blogspot.com/2009/01/is-javalangstringintern-really-evil.html

FYI, that's not my post :)

William Louth said...

Sorry but I think the "5 times" observation is incorrect as a general statement.

Can you post the actual code? I would love to see what is actually being measured in your micro-benchmark in particular the strings used and were the deviation happens if some do not match.

I have a cost table beside me for each op (field read, field write, method call,...) and the number you have quoted seems pretty low unless failures in matching happening very early or you are measuring something else.

William

Gili Nachum said...

Great article.
I agree, I think that the general philosophy should be to optimize later, and only if needed.
Looking into a heap dump of you app, during load, you can see right away what are the good candidates for intern.

sidewords said...

Hi,

I don't agree with your post. I'm working in the area on natural language processing. There we do a lot of data crunchings with words and lots of statistics. Let me say that without "intern()" we could go nowhere. It saves our day. Without it, I busted 64 Gb of Ram just because each word made its small nest memory. When you have applications where you must have vocabularies of hundreds of thousands of words permanently in memory, it helps a lot. As I said, you couldn't do anything without it ...except mapping them to integer ids and going on with the integers instead of the words. And in terms of speedup, it is not fiction but reality, transforming a program thout would take days into a program that takes hours.

So, please be cautious in what you say and investigate thouroughly. ...a last word though: yeah, for the classic avergae application, it's useless.

Matt Quigley said...

I find it strange that you said "== is 5 times faster than String.equals()" and then right after that said that the myth was busted. It's not a myth if it's 5 times faster! When one considers it in O(n) notation, String.equals() is O(n) and == is O(1); this is not a myth, it's fact.

Furthermore, I guarantee you that it's much more than 5 times faster in many applications; depending on the type of strings and how they are used, it can be many orders of magnitude faster that cannot be ignored.

In any case, to your point, if people don't know why they are using .intern() then they most certainly should not be using it.

Anonymous said...

Another reason why == performs only marginally faster than equals() is that String.equals(String) will probably do the == comparison as the first check (e.g. see sun jdk 1.6)

Anonymous said...

Dont forget Matt Quigley, String.equals short-cuts the comparison with a this==that, then tries a length comparison, then finally tries a == comparison on the char array elements.

William Louth, looking up your cost table for method calls means absolutely nothing when you're running code in a JVM that auto-inlines methods and compiles to native code. Unless you're comparing a large number of identical length strings it would be very rare to enter the loop that iterates the char array.

Anonymous said...

If you are running Java 7, internalized Strings are stored in the heap (assuming you are using the HotSpot JVM). So that removes the problem of running out of PermGen space.

Java 7 release notes

Ranx0r0x said...

In your loop you are constantly interning new strings. Of course that's going to run out of memory. base + i is going to result in the value of base concatenated with the number - "aaa...aa1", "aaa...aa2", etc. You are interning each of those unique strings. The code sample is pretty irrelevant. Try taking the "intern()" out. Do you eventually get an out of heap space error?

There are a number of places that interned strings make sense. If you are pulling data from a database and you have perhaps 100 or 200 companies in all the records, interning them ensures you are not duplicating the strings for each and every record.

This can't be done with public static final String unless you are going to manually enter every company into the class, recompile everything and then redeploy.

A 5* speed up can be quite significant and I'm willing to bet it gets to be much higher than that.

Saurabh Chhajed said...

I have some more details on how string works and how equals and == behaves for interned strings here -

http://saurzcode.wordpress.com/2014/05/19/string-interning-what-why-and-when/