Saturday, January 31, 2009

Positivism and Software Engineering

According to the positivist line of thought, a good scientific theory should be based on a few postulates and infer useful conclusions from those postulates using formal logical or mathematical constructs. It the theory is correct, it should be possible to verify its conclusions empirically. Positivism contrasts with the aristotelian approach, which was to infer the natural laws by pure reasoning instead of empirical observation. It wasn't until Galileo that empirical evidence started to be used to validate scientific theories.

Because of the way positivism works, under this approach it is never possible to prove that a theory is correct. It is only possible to increase the confidence on a theory through empirical observation. The more predictions of the theory are proved by empirical observation, the more confident we get in the theory. But it is possible to prove a theory wrong, by providing one single empirical example of when the theory's conclusions don't hold. If a theory is proven wrong, it becomes necessary to verify the logical inferences for errors, or go back and choose a new set of postulates. Strictly speaking, by this definition Newton physics is incorrect, since some of its predictions are incorrect in edge cases of mass or speed, but it is a good enough approximation for a large class of problems.

Lets take the special theory of relativity as an example of how the positivist approach works. Einstein started from 2 postulates. One was the principle of relativity, that states that the laws of physics have the same form for any frame of reference in uniform motion, enunciated by Galileo in 1639. The other postulate was that the speed of light, not time, is constant. Then, using mathematics, he inferred many useful conclusions that could all be verified empirically later. Positivism puts a strong emphasis on empirical evidence.

Einstein didn't take his postulates out of a hat. Observations had already shown in 1887 that the speed of light was constant, which contradicted the prevailing theory of the Ether. Einstein's great merit was to accept this empirical evidence as truth and use logic and mathematics to calculate what were the implications of this fact.

Well, this blog is not about physics, so how does this relate to software engineering?

For the sake of illustration, we could trace a parallel between the software development process and the scientific method. In software we start with requirements, which would correspond to the postulates of a theory, go through the process of building the software using good engineering practices, that would correspond to inferring conclusions from the postulates through logical inference, and end with the system satisfying the user needs, which would be like having the theory's conclusions being proven by empirical evidence.

Waterfall approaches are like the aristotelian method: after starting with a set of requirements, there is no empirical verification of the validity of the results until the system is put into production. Building a system that doesn't satisfy the user needs through a waterfall approach is the software engineering equivalent of the geocentric model of Aristotle.

Iterative approaches, on the other hand, employ a positivist approach. The process starts with requirements, and the results of combining the requirements to the engineering practices are constantly validated empirically. When the assumptions turn out to be wrong, or when the reasoning represented by the software engineering practices employed turns out to be incorrect, the result is a system that doesn't satisfy the customer needs.

This is why it is important to have frequent feedback, which is obtained by having short iteration cycles. With constant feedback, it is as if the theory gets constantly validated by empirical evidence.

Of course, software engineering has complications that physics doesn't have. As far as we know, the laws of physics don't change, but requirements do. It means that a system that satisfied the requirements in the past may become obsolete. Also, developing a theory from postulates in physics has to be done through strictly formal logic and mathematics. The software engineering equivalent would be to use strictly formal methods for architecture, design and coding. This approach is not possible or desirable in the vast majority of the circumstances because it would add an enormous overhead to the software development process.

Even risking getting into the field of pure speculation, we could extend this parallel between software engineering and positivism further and get to the conclusion that great software, like great scientific theories, require great people with the intelligence, the knowledge, and the insight. Einstein had the intelligence, had the knowledge of the latest scientific breakthroughs of the time and the insight to put it all together and come up with the theory of relativity. To the same measure, James Gosling had the knowledge of developing languages and compilers, the intelligence and the insight of putting some of his best ideas together in the Java platform. Bram Cohen had knowledge of peer-to-peer protocols when he developed BitTorrent.

Just like the western science made a great leap once we stopped using the aristotelian method, software engineering also got a great benefit from adopting a cycle of feedback.

Tuesday, January 20, 2009

A Memory Problem With Java IO

Me and a coworker found something interesting about Java IO the other day.

We were getting an OutOfMemoryError when trying to write the contents of a big byte array to a file. We all hate OutOfMemoryErrors: you never know what really caused it, specially when it happens in a production environment with tens or hundreds of simultaneous threads serving requests at the same time. The process of fixing it usually involves profiling the memory in search for memory leaks, which can be very time consuming.

This OutOfMemoryError was different though. It happened inside FileOutputStream.write(), which is a native call. I happened to have my personal laptop at the office that day, where I have the OpenJDK source code. I have NetBeans 6.5 setup with the OpenJDK projects, which makes navigating through its C++ code a breeze.

My discovery was interesting: when you write a byte array to a file, Java will copy the contents of the byte array into a native array and pass this native array to the native IO function. If the array size is equal or smaller than 8 Kbytes, it will copy into an area in the stack. If the size is greater than 8Kbytes, it will allocate memory in the native heap and use that area instead.

In our case, the byte array was something around 5 MB. Because of some native libraries that we use, we have to live with 32 bits JVMs, and the heap size of our application servers is set to 1024 MB. Along with a 196 MB PermGen size, plus the memory used by those native libraries, it left us with very little room left for the native code. The JVM failed to allocate the 5 MB native array, which was probably caused by heap fragmentation, and threw an OutOfMemoryError from the native side.

The longer term solution will be to switch to a 64 bit JVM, and the mid term solution will be to review the memory configuration of our application servers. In the short term, however, the solution was very simple: instead of writing the big byte array in one go, all we had to do was to write a loop to write in chunks of 8 Kbytes and the problem was gone!

Having the JDK source code at hand saved us a lot of time of guessing and going in the wrong directions in the dark.

Side note: Heap fragmentation is usually not a problem in Java, because the various JVM garbage collectors put a great deal of effort into compacting the heap (the most advanced GCs don't ensure complete compaction but they do their best). C doesn't have managed pointers, so heap fragmentation can become a real problem for C and C++ programs or libraries.

Monday, January 5, 2009

Busting java.lang.String.intern() Myths

(Or: String.intern() is dangerous unless you know what you're doing!)

If you ever peeked through the Javadoc or the source code of the Java String class, you probably noticed a misterious native method called intern(). The javadoc is very concise. It basically says that the method returns a representation of the String that is guaranteed to be unique through the JVM. If two String objects are internalized, they can be safely compared with == instead of equals. This description gives two reasons to use intern(): because comparision becomes faster and because there can be potential memory usage improvements because you wouldn't waste the heap with lots of equivalent strings.

The two reasons above are closer to myth than reality. The performance myth doesn't cause any harm, it is just that the gain is not that big as one would think it would be. But the memory usage improvement myth is where the danger lies: by trying to improve memory usage, one can actually end up causing OOM errors in the application.

Let's look at those myths with more detail.

Myth 1: Comparing strings with == is much faster than with equals()

An industrious developer could think of internalizing strings for performance reasons: you call intern() once, even if though it can be a costly operation, but then after that you can always compare the strings with ==. What a performance improvement it must be!

I wrote a quick benchmark to compare both approaches. It turns out that comparing strings with average length of 16 characters using equals is approximately only 5 times slower than comparing with ==. Even though a 5 times difference is still a large number, you may be surprised that the gap isn't bigger. There are two reasons for this. First, String.equals() only compares the characters as a last effort: it first compares the length of the strings, which is stored in a separate field, and only if the lengths are the same it starts comparing the characters, but it halts as soon as it finds the first non matching character.

Another reason for the relatively small difference between the two approaches is that the HotSpot optimizer does a very good job of optimizing method calls, and String.equals() is a very good candidate for inlining since it is a small method that belongs to a final class. That removes any overhead related to method calls.

Now, == provides a 5-fold improvement over equals(). But since String comparision usually represents only a small percentage of the total execution time of an application, the overall gain is much smaller than that, and the final gain will be diluted to a few percent.

So Myth 1: busted! Yes, == is faster than String.equals(), but in general it isn't near a performance improvement as it is cracked up to be.

Myth 2: String.intern() saves a lot of memory

This myth is where the danger lies. On one hand, it is true that you can remove String duplicates by internalizing them. The problem is that the internalized strings go to the Permanent Generation, which is an area of the JVM that is reserved for non-user objects, like Classes, Methods and other internal JVM objects. The size of this area is limited, and is usually much smaller than the heap. Calling intern() on a String has the effect of moving it out from the heap into the permanent generation, and you risk running out of PermGen space.

I wrote a small test program that confirm this (see below). The call to Thread.sleep(1000) is so that you can see the permanent generation going up in a profiler. You can check it yourself by running this program and then running jconsole which is available in the JDK distribution. Go to jconsole's Memory tab and select Memory Pool "Perm Gen" in the dropdown box. You will see the permanent generation going up steadly until the process terminates with a java.lang.OutOfMemoryError: PermGen space.
import java.util.ArrayList;
import java.util.List;

public class Main {
public static void main(String[] args) throws Exception {
int steps = 1000;
String base = getBaseString();

List strings = new ArrayList();
int i = 0;
while (true) {
String str = base + i;
str = str.intern();
strings.add(str);
i++;
if (i % steps == 0) {
Thread.sleep(1000);
}

}
}

private static String getBaseString() {
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 1000; i++) {
builder.append("a");
}
return builder.toString();
}
}

So Myth 2: busted! String.intern() saves heap space, but at the expense of using up the more precious PermGen space.

Myth 3: internalized strings stay in the memory forever

This myth goes in the opposite direction of myth 2. Some people belive that internalized strings stay in the memory until the JVM ends. It may have been true a long time ago, but today the internalized strings are garbage collected if there are no more references to them. See below a slightly modified version of the program above. It clears the references to internalized strings from time to time. If you follow the program execution from jconsole, you will see that the PermGen space usage goes up and down, as the Garbage Collector reclaims the memory used by the unreferenced internalized strings.
import java.util.ArrayList;
import java.util.List;

public class Main {
public static void main(String[] args) throws Exception {
int steps = 1000;
String base = getBaseString();

List strings = new ArrayList();
int i = 0;
while (true) {
String str = base + i;
str = str.intern();
strings.add(str);
i++;
if (i % steps == 0) {
Thread.sleep(1000);
}

if (i % (steps * 4) == 0) {
strings = new ArrayList();
}
}
}

private static String getBaseString() {
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 1000; i++) {
builder.append("a");
}
return builder.toString();
}
}

Myth 3: Busted! Internalized strings are released if they are no longer referenced.

Note: when == is worth over equals()

If you are doing heavy text processing you may want to internalize strings. But in this case, you are probably better off using an approach that I outlined here: Weak Object Pools With WeakHashMap. With this approach, you get the benefit of having unique Strings, but without the penalty of using up the PermGen space.

Conclusion: always know what you are doing

As I said in the subtitle of this article, String.intern() is dangerous if you don't know what you are doing. Now you know the risks of using String.intern(), and you will be able to make a more informed decision about whether to use it or not.

More information

  • The OpenJDK source code is the JVM Hacker's heaven. It is not easy reading though, and you need to know C++ to take advantage of what it has to offer.
  • Presenting the Permanent Generation is a very good article about what is the PermGen and what goes inside it.