Wednesday, June 3, 2009

The .NET Framework Assistant Hysteria

Yes I know this is old news, but I had this post as a draft for a while and never got around to finish it, so here it is now.

A couple of weeks ago there was a major hysteria around the blogging world because of a Microsoft .Net update that installed an add-on to Firefox called .NET Framework Assistant. The outrage was because the add-on was installed without the user's knowledge when he or she downloaded an update to the .Net framework.

What does this add-on do? It lets Firefox run .NET ClickOnce applications, which are the equivalent to Java WebStart applications. Basically those technologies allow the user to click on a link and download an application that then runs on the user's machine. This is not a new concept and has been around for ages.

Now, I don't remember seeing any outrage with the fact that once you install Java it becomes possible to run WebStart applications from Firefox. But just because it is Microsoft who is doing it, we see ridiculous claims like "Is Microsoft Sabotaging Firefox With Sneaky .NET Updates?". And the worst part is that non-technical people believe this crap and perpetuate the hysteria.

As I get old I have less and less patience to this kind of fanboyism. The world has gone a long way since the days when Microsoft was seen as the evil force that had a monopoly on our digital lives.

Thursday, May 21, 2009

Stupid Balloon Tips


One of the most dreaded UI elements in my point of view is the annoying balloon notification that shows up in the system tray. The problem is not the UI solution by itself: it is actually useful and much better than throwing a popup window in your face. After all, who has never hitted enter by mistake in a popup window that had shown up a split second before, without even knowing what it said?

The balloon notification is much less intrusive. This 'quality', however, ends up causing it to be extremely overused. I recently got an Eee PC with Windows XP preinstalled. And I found what is the most useless balloon notification I ever seen. In the Eee PC 1000he, whenever I plug a headphone, I get the notification above. As if I needed the computer to tell me what I just did!

Balloon notifications are distractions. Just like those tiny IM or email notifications that scroll up from the system tray. As a developer, you should avoid those distractions as much as possible. Remember that the user may be doing other stuff in the computer. The purpose of your software is to help get things done, and not get in the way of the user's work.

Wednesday, May 20, 2009

Quote of the day

From Tom Kyte's Expert Oracle Database Architecture:

"There are no silver bullets -- none. If there were, they would be the default behavior and you would never hear about them".

Sunday, May 3, 2009

Google Tech Talks

Google Tech Talks is an excellent Youtube channel with full length tech presentations given at Google. There are talks in a wide variety of topics that appeal to developers. A few recent examples are "Twitter WTF? - Why is Twitter Called a Threat to Google?" by Laura Fitton, "Learning from StackOverflow.com" by Joel Spolsky, and "Compiling and Optimizing Scripting Languages" just to name a few.

I strongly recommend that you subscribe to their RSS feed and watch out for the presentations on topics that interest to you. If you can't be at conferences all the time, this is the next best thing -- except that you miss the coffee breaks and the networking :)

What other technical presentation links you guys suggest?

Friday, May 1, 2009

The Agile Fallacy

Or: Agile didn't invent iterative development



12 years ago, when I was having computer science classes at college, teachers used to tell us that waterfall processes were bad, and that iterative processes were a Good Thing. They told us we should do a little bit of requirements, then a little bit of design, then a little bit of development and a little bit of testing. Then you would evaluate the results, review your planning and start over. There was even this fancy little spiral to illustrate the concept:



The concept of iterative development wasn't new by then, way before people started talking about Agile methods. But if you read what many evangelists and bloggers have to say, you would think that Agile 'invented' the concept. You read all over things like "in waterfall, this and that happens, but Agile solves this problem by doing this other thing...". People seem to conveniently pretend that if you are not Agile, you are doing waterfall. This is what I call the Agile fallacy: the false dichotomy between Agile and Waterfall, used as an argument to justify Agile.

As Penn & Teller would put it, that's BULLSHIT. Don't buy into this crap folks. Not to say that Agile is bad. Far from it. It is a Good Thing, and is a huge step forward in the field of software development. It brings in some fresh new ideas that will be around even when nobody is talking about Agile anymore. Also, I'm not saying that Agile is iterative development with a fancy name. That is an oversimplification made only by people who don't understand the existing Agile methods. But they are not the only alternative to Waterfall processes, and the alternative existed before Agile. Even RUP, the overly complex process framework that nobody seems to talk about anymore, which for many people is a synonym of waterfall, is actually iterative.

Now, why is the Agile Fallacy perpetuated? There are two kinds of people that defend it. The first group is what I call the 25 years old specialists. These are young people, fresh out of college, who are very articulated and smart. Those people are well intentioned, but they probably have never seen anything other than Agile in their careers, so they just inadvertently perpetuate the lies that are spread by the second kind of people, the Unethical Agile Salesmen. These are the ones who lie to make their sales pitch look better just like a used car salesman will lie to you about the defects and quality of a used car to make it look like a hot offer. It is important to note, though, that I don't want to make generalizations. Like in any other field, Agile has ethical and unethical professionals, and there some very credible people in this field.

Why am I bitching about the Agile Fallacy so much? The first problem I see with it is that it takes away the credibility from the people who defend Agile, and I think it is one of the biggest problems of the field today. There are just too many Agile snake oil salesmen around.

Another problem is that in this world of sotware development, there have always been the hot idea du jour. When it comes, it seems to be the Holy Grail of computing, the thing that came to solve all our problems. People embrace it and bet their success on it. Then after some time people will realize the real qualities and flaws of the new idea, and their disappointment is equivalent to the amount of passion that they had when they embraced it initially, and will dismiss that now not so new idea completely even though it has some legitimate good points. Agile has been the Holy Grail for some time now, and it won't be long until people realize that it is not perfect. To a certain extent, it is already happening, with people saying that Lean software developmen is the new Hot Idea and came to replace Agile. This is also a false dichotomy but is a subject for another article. But what happens is that after the period of passion people will notice the fallacy and it will work against Agile this time.

To avoid this trap, always be skeptical. If you take a skeptical point of view from the beginning, you won't be disappointed later. And better, you will be able to enjoy the real benefits of Agile.

Sunday, April 26, 2009

Read-Only transactions with Spring and Hibernate

Spring supports the concept of read-only transactions. Because Spring doesn't provide persistence functionality itself, the semantics of read-only transactions depend on the underlying persistence framework used along with Spring.

I use Spring with Hibernate and Oracle, and when I looked to understand the semantics of read-only transactions on this specific configuration, I found that there was very little information on the web. The existing information is scarce and not very clear, and as a result I had to do some research myself, which included hacking into Spring and Hibernate's source code. Not that I don't enjoy spending a few late hours reading good code, but so that you don't have to do it yourself, here is what I found.

Spring

Spring's documentation doesn't say almost anything about what a read-only transaction really means. The best information I could find was:

Read-only status: a read-only transaction does not modify any data. Read-only transactions can be a useful optimization in some cases (such as when using Hibernate).

That's basically all it says. Google and a little hacking shed some light on the real meaning of the sentence above: if the transaction is marked as read-only, Spring will set the Hibernate Session's flush mode to FLUSH_NEVER, and will set the JDBC transaction to read-only. Now lets understand what it means and what are the implications in a Hibernate/Oracle setup.

Hibernate

Hibernate doesn't have the concept of read-only sessions. But when a session's flush mode is set to FLUSH_NEVER, which is what Spring does, two interesting things happen. First, running HQL queries no longer cause Hibernate to flush the session state to the database, which can provide a dramatic performance improvement. Secondly, Hibernate will not flush the changes before commiting the transaction. But the user can still call Session.flush() by hand, causing any modifications to be persisted to database. This is where Spring's call to Connection.setReadOnly() comes handy.

Oracle

When using the Oracle JDBC driver, calling connection.setReadOnly(true) translates into the statement "SET TRANSACTION READ ONLY". This statement limits the types of SQL statements that can be executed during the transaction. Only SELECTS (without 'FOR UPDATE') and a few other statements can be executed. Specifically, no UPDATEs, DELETEs, INSERTs or MERGEs can be executed. This behavior is Oracle-specific. Other RDBMS can have different semantics for read only transactions or simply not support it at all.

By setting the JDBC connection to read-only, Spring prevents a distracted user from persisting changes by flushing the Hibernate session to the database.

Notes

As we saw, with the two measures taken by Spring, the transaction is guaranteed to be read-only through the JDBC connection, and performance improvements are obtained by setting the Hibernate session to FLUSH_NEVER.

There is one thing that doesn't happen, though. Even during Spring read-only transactions, Hibernate queries still save the state of persistent objects in the session cache. In theory it wouldn't be necessary, since this state is used to detect modifications during session flushes. Depeding on the size and number of objects it can make a huge difference in terms of memory usage.

If you still want to prevent Hibernate from saving the object state in the session cache, you have to manually run the HQL queries in read-only mode. It would be a nice improvement to Hibernate to have a read-only mode to the session so that no object state is stored and no flush executed.

Tuesday, February 3, 2009

Understanding the Retained Size

I have used two excellent memory analysis tools that provide an important metric called Retained Size. Those tools are the Yourkit Java Profiler, which is a commercial tool worth every penny of its price, and the Eclipse Memory Analyzer, an excellent open source Java Heap Analyzer. If all you need is to do offline analysis of heap dumps the Eclipse Memory Analyzer is everything you will ever need.

Markus Kohler who is an architect at SAP and worked on the Eclipse Memory Analyzer just published an excellent article explaining the definition of the Retained Size and how to calculate it.

Saturday, January 31, 2009

Positivism and Software Engineering

According to the positivist line of thought, a good scientific theory should be based on a few postulates and infer useful conclusions from those postulates using formal logical or mathematical constructs. It the theory is correct, it should be possible to verify its conclusions empirically. Positivism contrasts with the aristotelian approach, which was to infer the natural laws by pure reasoning instead of empirical observation. It wasn't until Galileo that empirical evidence started to be used to validate scientific theories.

Because of the way positivism works, under this approach it is never possible to prove that a theory is correct. It is only possible to increase the confidence on a theory through empirical observation. The more predictions of the theory are proved by empirical observation, the more confident we get in the theory. But it is possible to prove a theory wrong, by providing one single empirical example of when the theory's conclusions don't hold. If a theory is proven wrong, it becomes necessary to verify the logical inferences for errors, or go back and choose a new set of postulates. Strictly speaking, by this definition Newton physics is incorrect, since some of its predictions are incorrect in edge cases of mass or speed, but it is a good enough approximation for a large class of problems.

Lets take the special theory of relativity as an example of how the positivist approach works. Einstein started from 2 postulates. One was the principle of relativity, that states that the laws of physics have the same form for any frame of reference in uniform motion, enunciated by Galileo in 1639. The other postulate was that the speed of light, not time, is constant. Then, using mathematics, he inferred many useful conclusions that could all be verified empirically later. Positivism puts a strong emphasis on empirical evidence.

Einstein didn't take his postulates out of a hat. Observations had already shown in 1887 that the speed of light was constant, which contradicted the prevailing theory of the Ether. Einstein's great merit was to accept this empirical evidence as truth and use logic and mathematics to calculate what were the implications of this fact.

Well, this blog is not about physics, so how does this relate to software engineering?

For the sake of illustration, we could trace a parallel between the software development process and the scientific method. In software we start with requirements, which would correspond to the postulates of a theory, go through the process of building the software using good engineering practices, that would correspond to inferring conclusions from the postulates through logical inference, and end with the system satisfying the user needs, which would be like having the theory's conclusions being proven by empirical evidence.

Waterfall approaches are like the aristotelian method: after starting with a set of requirements, there is no empirical verification of the validity of the results until the system is put into production. Building a system that doesn't satisfy the user needs through a waterfall approach is the software engineering equivalent of the geocentric model of Aristotle.

Iterative approaches, on the other hand, employ a positivist approach. The process starts with requirements, and the results of combining the requirements to the engineering practices are constantly validated empirically. When the assumptions turn out to be wrong, or when the reasoning represented by the software engineering practices employed turns out to be incorrect, the result is a system that doesn't satisfy the customer needs.

This is why it is important to have frequent feedback, which is obtained by having short iteration cycles. With constant feedback, it is as if the theory gets constantly validated by empirical evidence.

Of course, software engineering has complications that physics doesn't have. As far as we know, the laws of physics don't change, but requirements do. It means that a system that satisfied the requirements in the past may become obsolete. Also, developing a theory from postulates in physics has to be done through strictly formal logic and mathematics. The software engineering equivalent would be to use strictly formal methods for architecture, design and coding. This approach is not possible or desirable in the vast majority of the circumstances because it would add an enormous overhead to the software development process.

Even risking getting into the field of pure speculation, we could extend this parallel between software engineering and positivism further and get to the conclusion that great software, like great scientific theories, require great people with the intelligence, the knowledge, and the insight. Einstein had the intelligence, had the knowledge of the latest scientific breakthroughs of the time and the insight to put it all together and come up with the theory of relativity. To the same measure, James Gosling had the knowledge of developing languages and compilers, the intelligence and the insight of putting some of his best ideas together in the Java platform. Bram Cohen had knowledge of peer-to-peer protocols when he developed BitTorrent.

Just like the western science made a great leap once we stopped using the aristotelian method, software engineering also got a great benefit from adopting a cycle of feedback.

Tuesday, January 20, 2009

A Memory Problem With Java IO

Me and a coworker found something interesting about Java IO the other day.

We were getting an OutOfMemoryError when trying to write the contents of a big byte array to a file. We all hate OutOfMemoryErrors: you never know what really caused it, specially when it happens in a production environment with tens or hundreds of simultaneous threads serving requests at the same time. The process of fixing it usually involves profiling the memory in search for memory leaks, which can be very time consuming.

This OutOfMemoryError was different though. It happened inside FileOutputStream.write(), which is a native call. I happened to have my personal laptop at the office that day, where I have the OpenJDK source code. I have NetBeans 6.5 setup with the OpenJDK projects, which makes navigating through its C++ code a breeze.

My discovery was interesting: when you write a byte array to a file, Java will copy the contents of the byte array into a native array and pass this native array to the native IO function. If the array size is equal or smaller than 8 Kbytes, it will copy into an area in the stack. If the size is greater than 8Kbytes, it will allocate memory in the native heap and use that area instead.

In our case, the byte array was something around 5 MB. Because of some native libraries that we use, we have to live with 32 bits JVMs, and the heap size of our application servers is set to 1024 MB. Along with a 196 MB PermGen size, plus the memory used by those native libraries, it left us with very little room left for the native code. The JVM failed to allocate the 5 MB native array, which was probably caused by heap fragmentation, and threw an OutOfMemoryError from the native side.

The longer term solution will be to switch to a 64 bit JVM, and the mid term solution will be to review the memory configuration of our application servers. In the short term, however, the solution was very simple: instead of writing the big byte array in one go, all we had to do was to write a loop to write in chunks of 8 Kbytes and the problem was gone!

Having the JDK source code at hand saved us a lot of time of guessing and going in the wrong directions in the dark.

Side note: Heap fragmentation is usually not a problem in Java, because the various JVM garbage collectors put a great deal of effort into compacting the heap (the most advanced GCs don't ensure complete compaction but they do their best). C doesn't have managed pointers, so heap fragmentation can become a real problem for C and C++ programs or libraries.

Monday, January 5, 2009

Busting java.lang.String.intern() Myths

(Or: String.intern() is dangerous unless you know what you're doing!)

Update: The subject discussed in this article was true up to Java 6. In Java 7, interned strings are no longer allocated in the permanent generation, but in the main Java Heap. I've decided to leave this article here for historical purposes, but keep in mind that the allocation of interned strings in the main heap makes the intern() method an appealing feature to prevent string explosion in the heap.

If you ever peeked through the Javadoc or the source code of the Java String class, you probably noticed a misterious native method called intern(). The javadoc is very concise. It basically says that the method returns a representation of the String that is guaranteed to be unique through the JVM. If two String objects are internalized, they can be safely compared with == instead of equals. This description gives two reasons to use intern(): because comparision becomes faster and because there can be potential memory usage improvements because you wouldn't waste the heap with lots of equivalent strings.

The two reasons above are closer to myth than reality. The performance myth doesn't cause any harm, it is just that the gain is not that big as one would think it would be. But the memory usage improvement myth is where the danger lies: by trying to improve memory usage, one can actually end up causing OOM errors in the application.

Let's look at those myths with more detail.

Myth 1: Comparing strings with == is much faster than with equals()

An industrious developer could think of internalizing strings for performance reasons: you call intern() once, even if though it can be a costly operation, but then after that you can always compare the strings with ==. What a performance improvement it must be!

I wrote a quick benchmark to compare both approaches. It turns out that comparing strings with average length of 16 characters using equals is approximately only 5 times slower than comparing with ==. Even though a 5 times difference is still a large number, you may be surprised that the gap isn't bigger. There are two reasons for this. First, String.equals() only compares the characters as a last effort: it first compares the length of the strings, which is stored in a separate field, and only if the lengths are the same it starts comparing the characters, but it halts as soon as it finds the first non matching character.

Another reason for the relatively small difference between the two approaches is that the HotSpot optimizer does a very good job of optimizing method calls, and String.equals() is a very good candidate for inlining since it is a small method that belongs to a final class. That removes any overhead related to method calls.

Now, == provides a 5-fold improvement over equals(). But since String comparision usually represents only a small percentage of the total execution time of an application, the overall gain is much smaller than that, and the final gain will be diluted to a few percent.

So Myth 1: busted! Yes, == is faster than String.equals(), but in general it isn't near a performance improvement as it is cracked up to be.

Myth 2: String.intern() saves a lot of memory

This myth is where the danger lies. On one hand, it is true that you can remove String duplicates by internalizing them. The problem is that the internalized strings go to the Permanent Generation, which is an area of the JVM that is reserved for non-user objects, like Classes, Methods and other internal JVM objects. The size of this area is limited, and is usually much smaller than the heap. Calling intern() on a String has the effect of moving it out from the heap into the permanent generation, and you risk running out of PermGen space.

I wrote a small test program that confirm this (see below). The call to Thread.sleep(1000) is so that you can see the permanent generation going up in a profiler. You can check it yourself by running this program and then running jconsole which is available in the JDK distribution. Go to jconsole's Memory tab and select Memory Pool "Perm Gen" in the dropdown box. You will see the permanent generation going up steadly until the process terminates with a java.lang.OutOfMemoryError: PermGen space.
import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) throws Exception {
        int steps = 1000;
        String base = getBaseString();

        List strings = new ArrayList();
        int i = 0;
        while (true) {
            String str = base + i;
            str = str.intern();
            strings.add(str);
            i++;
            if (i % steps == 0) {
                Thread.sleep(1000);
            }

        }
    }

    private static String getBaseString() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < 1000; i++) {
            builder.append("a");
        }
        return builder.toString();
    }
}

So Myth 2: busted! String.intern() saves heap space, but at the expense of using up the more precious PermGen space.

Myth 3: internalized strings stay in the memory forever

This myth goes in the opposite direction of myth 2. Some people belive that internalized strings stay in the memory until the JVM ends. It may have been true a long time ago, but today the internalized strings are garbage collected if there are no more references to them. See below a slightly modified version of the program above. It clears the references to internalized strings from time to time. If you follow the program execution from jconsole, you will see that the PermGen space usage goes up and down, as the Garbage Collector reclaims the memory used by the unreferenced internalized strings.
import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) throws Exception {
        int steps = 1000;
        String base = getBaseString();

        List strings = new ArrayList();
        int i = 0;
        while (true) {
            String str = base + i;
            str = str.intern();
            strings.add(str);
            i++;
            if (i % steps == 0) {
                Thread.sleep(1000);
            }

            if (i % (steps * 4) == 0) {
                strings = new ArrayList();
            }
        }
    }

    private static String getBaseString() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < 1000; i++) {
            builder.append("a");
        }
        return builder.toString();
    }
}

Myth 3: Busted! Internalized strings are released if they are no longer referenced.

Note: when == is worth over equals()

If you are doing heavy text processing you may want to internalize strings. But in this case, you are probably better off using an approach that I outlined here: Weak Object Pools With WeakHashMap. With this approach, you get the benefit of having unique Strings, but without the penalty of using up the PermGen space.

Conclusion: always know what you are doing

As I said in the subtitle of this article, String.intern() is dangerous if you don't know what you are doing. Now you know the risks of using String.intern(), and you will be able to make a more informed decision about whether to use it or not.

More information

  • The OpenJDK source code is the JVM Hacker's heaven. It is not easy reading though, and you need to know C++ to take advantage of what it has to offer.
  • Presenting the Permanent Generation is a very good article about what is the PermGen and what goes inside it.