Tuesday, September 23, 2008

Weak Object Pools with WeakHashMap

In a previous article, I wrote about what the WeakHashMap is not good for. Now, to make justice to this very useful but misunderstood class, I am going to devote an article to talk about one of its good applications. I will explain how to use a WeakHashMap to implement an object pool that can be used to promote the reuse of instances of immutable objects, while at the same time avoiding memory leaks.

Reviewing the WeakHashMap

The WeakHashMap is a Java collection that uses the WeakReference class to hold its keys. As we've already seen, weak references to an object are cleared by the garbage collector as soon as there are no strong or soft references to the same object. The result is that entries only remain in the WeakHashMap as long as there are references to the map keys lying around your JVM.

Here is an example:
import java.util.WeakHashMap;

public class WeakHashMapSample1 {
public static void main(String[] args) {
WeakHashMap weakHashMap = new WeakHashMap();
// Create a key for the map, but keep the strong reference
String keyStrongReference = new String("key");
weakHashMap.put(keyStrongReference, "value");
// Run the GC and check if the key is still there.
System.gc();
System.out.println(weakHashMap.get("key"));
// Now, null-out the strong reference and try again.
keyStrongReference = null;
System.gc();
System.out.println(weakHashMap.get("key"));
}
}

The code above prints:
value
null

What happened here? At the first time we called System.gc(), there was still a reference to the key in the keyStrongReference variable. Because of this, the map key was not cleared by the garbage collector. In the following line, we got rid of the strong reference to the key and tried again. This time, the call to weakHashMap.get("key") returned null.

Implementing a Weak Object Pool with a WeakHashMap

If you make heavy use of small immutable classes, like String and the primitive wrapper classes like java.lang.Integer, you can take advantage of the WeakHashMap behavior to share instances and reduce memory usage, while at the same time not having to worry about objects lingering in memory after they are no longer used. The real benefit depends on how many instances you create and how the values are distributed, but you can potentially reduce the number of objects created by several orders of magnitude.

Here is an example of a weak object pool:

import java.lang.ref.WeakReference;
import java.util.Map;
import java.util.WeakHashMap;

/**
 * Oversimplistic implementation of an object pool
 */
public class WeakObjectPool {
// Map where the key is an object, and the value is a weak reference
// to the same object.  We use the key to do the lookup, and the value
// to actually return the object when it is found.
private Map map = new WeakHashMap();

public Object replace(Object object) {
WeakReference reference = (WeakReference) map.get(object);
if (reference != null) {
Object result = reference.get();
// Another null check, since the GC may have kicked in between the 
// two lines above.
if (result != null) {
return result;
}
}
// If we got here it is because the map doesn't have the key, add it.
map.put(object, new WeakReference(object));
return object;
}
}

Now, this class can be used like this:
class ObjectPoolClient {
private static WeakObjectPool objectPool = new WeakObjectPool();

public static void main(String args[]) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader("input.csv"));
List<String[]> parsedLines = new ArrayList<String[]>();
String line;
while ((line = reader.readLine()) != null) {
String[] elements = line.split(",");
for (int i = 0; i < elements.length; i++) {
// replace the string read from the file with the pool instance
elements[i] = (String) objectPool.replace(elements[i]);
}
parsedLines.add(elements);
}
reader.close();

// Cool, we saved a lot of memory by reusing the repeated strings!
doSomethingInteresting(parsedLines);
// Now, we get rid of the references and soon the garbage collector
// will reclaim the memory
parsedLines = null;
doMoreInterestingStuff();
}
}

Assuming the input file contains lots of repeated values, we've been able to save a lot of heap space by not having the same string repeated over and over again in memory. Also, we get the added benefit of releasing the memory when the strings are no longer needed.

Weak Object Pool in conjunction with the Flyweight pattern

The Flyweight pattern is a perfect match for the Weak Object Pool. The assumption behind this pattern is that the flyweight instances are shared to reduce memory consumption. The flyweight factory could use a Weak Object Pool to store the flyweights. This way, once a given flyweight is no longer in use, it will be released from the factory's storage and its memory reclaimed.

Words of caution

Don't go out using Weak Object Pools everywhere you have immutable classes instantiation. It is only an advantage to use it when there is a lot of repetition of the values. If the values are more randomly distributed, you will be better off not using this pattern, because it incurs in a small memory overhead for the internal map structures. If you apply the Weak Object Pool pattern in the wrong situation, you may end up with worse performance!

Also, it is very important to use this pattern only to store immutable classes . If you use this pattern for a non-immutable class, you can find yourself with bugs that are very difficult to reproduce and fix. Those bugs may happen if an instance of an object stored in the Weak Object Pool is shared by two completely unrelated clients, and one client modifies the instance. Then the other client will see the modified value and it will be very difficult to trace the original modification of the object.

More information:

Sunday, September 21, 2008

Martin Fowler got it wrong?

Martin Fowler recently posted to his blog a comment about software requirements. He starts by quoting the opening paragraph of the "Mastering the Requirements Process" book:

Requirements are the things that you should discover before starting to build your product. Discovering the requirements during construction, or worse, when you client starts using your product, is so expensive and so inefficient, that we will assume that no right-thinking person would do it, and will not mention it again.


Then he goes on to criticize the statement above, by comparing it to the agile processes, where you have continuous discovery of the requirements. Here is what he says:

Agile methods violate this underlying assumption by intending to discover the 'requirements' during construction and after delivery.


I don't quite agree with Martin Fowler here. Agile processes are inherently iterative, but even in an iterative process the requirements for a given iteration are supposed to be known before that iteration starts. At the beginning of a Scrum iteration, for instance, the team is supposed to know exactly what it has to build, as defined in the Sprint Backlog. An iteration doesn't start if the team doesn't have the scope very well defined. The opposite to it would be to sit down in front of a computer in a given morning to begin writing the revenues recognition module without any idea of how the revenues recognition rules are suppose to work, and starting to ask around to see how revenues recognition works.

The even worse thing mentioned in the quote above is to discover the requirements when the system is in production. It is equivalent to write the revenues recognition module using your own assumptions to what the business rules are and then finding out during production that the rules you implemented are completely wrong.

What Agile methods do have is the ability to adapt to change. Requirements change because the market or the business change, and they also change because the people involved get a better understanding of what the requirements are. The assumption is that you will get the requirements wrong, and this is very often the case, but you are still hoping to have gotten them right at the first time. If you got them wrong, though, Agile processes have the means to adapt to this change.

Saturday, September 20, 2008

WeakHashMap is not a cache! Understanding WeakReference and SoftReference

If you ever found yourself in the need to implement a simple caching functionality in your Java programs, chances are that you at least considered using the WeakHashMap class as the cache.

It turns out that the WeakHashMap makes for a terrible cache, and for two reasons. The first reason is that it uses weak references as the underlying memory management mechanism. The second reason is that the weak references are used for the keys and not for the values, which is what you would want.

Reference classes and reachability

To understand what the WeakHashMap is good for, we need to understand the WeakReference and SoftReference classes and what is the difference between them. Both extend from the Reference class, which resides, along with its children, in the java.lang.ref package. The Reference classes are used to represent object references that are weaker than regular java references, which are called strong references Objects that can be reached by a chain of only strong references never get garbage collected. The weaker the references to an object, the more likely the object will be reclaimed by the garbage collector.

The stronger type of reference is the strong reference, like when you declare String name = "John Doe";. The name variable is a strong reference to the "John Doe" String object. SoftReferences are weaker than strong references, and WeakReferences are weaker than SoftReferences. There is also an even weaker type of reference, the PhantonReference, of which I'm not going to talk about here.

The type of references involved in the reference chain that starts from a local or a static variable and ends in an object defines the type of object's reachability. The Java API explains the different categories of object reachability this way:

  • An object is strongly reachable if it can be reached by some thread without traversing any reference objects. A newly-created object is strongly reachable by the thread that created it.

  • An object is softly reachable if it is not strongly reachable but can be reached by traversing a soft reference.

  • An object is weakly reachable if it is neither strongly nor softly reachable but can be reached by traversing a weak reference. When the weak references to a weakly-reachable object are cleared, the object becomes eligible for finalization.

  • An object is phantom reachable if it is neither strongly, softly, nor weakly reachable, it has been finalized, and some phantom reference refers to it.

  • Finally, an object is unreachable, and therefore eligible for reclamation, when it is not reachable in any of the above ways.


if the garbage collector determines that an object is strongly reachable, it will not reclaim the object. This is what we would expect. Nobody wants to have an object garbage collected when it can still be reached by a chain of strong references. Now, here is the important point that is not written in the explanation above: If the garbage collector determines that an object is softly reachable, it may clear atomically all soft references to the object, in the case that it finds that memory is running low, or at its own discretion. But if the garbage collector determines that an object is weakly reachable, it will clear atomically all weak references to the object. This is the major difference between weak and soft references and the reason that makes the WeakReference ill-suited for caching.

What is the WeakHashMap good for?

Now it is easy to understand why the WeakHashMap doesn't work for caching. First of all it wouldn't work anyway because it uses soft references for the keys and not for the map values. But additional to that, the garbage collector aggressively reclaims the memory that is referenced only by weak references. It means that once you lose the last strong reference to an object that is working as a key in a WeakHashMap, the garbage collector will soon reclaim that map entry.

If the WeakHashMap is no good for caching, then what is it good for? It is good to implement canonical maps. Lets say you want to associate some extra information to an object that you have a strong reference to. You put an entry in a WeakHashMap with the object as the key, and the extra information as the map value. Then, as long as you keep a strong reference to the object, you will be able to check the map to retrieve the extra information. And once you release the object, the map entry will be cleared and the memory used by the extra information will be released.

Can I just copy and paste the WeakHashMap class to write my cache?

No. Please, don't copy and paste the WeakHashMap source code replacing WeakReference with SoftReference. This won't be effective. To understand why, look at this example:
SoftHashMap cache = new SoftHashMap(); // A copy and paste from WeakHashMap
SomeExpensiveClass myReference1 = .... // get expensive class instance
cache.put(new Long(10), myReference1); // put the expensive object in the cache
... // do some stuff, but keep the myReference1 variable around!
SomeExpensiveClass myReference2 = cache.get(new Long(10)); // query the cache
if (myReference2 == null) {
// Uh-oh, the cache got rid of the object, even though I
// still had a reference to it in the myReference1 variable!
}

You would expect the cache to keep the reference to the object, since the myReference1 variable was still around. This may happen, but it may also happen that the cache will have been cleared. Why is this? Because at the "do some stuff" block, the garbage collector may have kicked in, noticed that the map's key was softly reachable, and chosen to garbage collect it. Remember, the WeakHashMap uses WeakReference for the keys, so your copy-paste implementation would use SoftReference for the keys. This reduces the effectiveness of the cache because you would expect the object to remain in the cache since the strong reference to it is still around. But the problem is that there is no strong reference to the key in the example above.

So how the hell do I implement a cache in Java?

My suggestion is to use one of the freely available Cache implementations, like JCS, OSCache and others. Those libraries provide better memory management with LRU and FIFO policies for instance, disk overflow, data expiration and many other optional advanced features.

If you still want to implement a cache class that takes advantage of SoftReferences yourself, implement your own Map class that extends AbstractMap and internally wrap the map values in SoftReferences. Then you need to implement an internal mechanism to remove the stale entries from the map. For this, you may want to use a ReferenceQueue, and poll the ReferenceQueue for collected entries. Then, for each collected entry, you will need to remove it from the map. Make sure you do this in an efficient way! Or you may end up with some nested loops and O(n2) operations and your cache performance will suck.

More Information

API documentation of the java.lang.ref package for an explanation of the Reference classes and object reachability.
API documentation of the WeakHashMap class for more information on the limitations and application of the SeakHashMap class.

Sunday, September 7, 2008

Work environments and the Flow

In this post I will talk about the Flow state, and why it is important to provide developers with environments that enable then to attain that state. First, what is Flow?

"Flow" is the name given to a mental state of deep concentration, when you can spend literally hours on a problem without getting easily distracted. In this state people can work on harder problems than they can usually work on, because they can keep track of more problem variables and levels of abstraction. When you are in the Flow it seems that the time flies by, but in the end you are still amazed at how much work you accomplished during that period. Flow is the state you are usually in when you are alone at the office working late hours without anything to distract you. During those late hours, you are usually able to get more work done than during the whole day.

It is very difficult to get into the Flow, but it is very easy to lose it. It usually takes 10 to 15 minutes to get into a Flow state, and during this period any small distraction breaks you concentration and you have to start over. Once you are in the Flow things are better, and you only lose it if there is a distraction directed specifically to you, like a coworker calling your name to ask you a question, for instance. Every time you wake up from the Flow state, it takes you 10 to 15 minutes to go back to it again.

Flow is THE productivity silver bullet for software development. It enables developers to work faster and produce better quality code. If people are writing software outside of the flow, they usually introduce much more bugs because of the lack of concentration.

Software companies should provide their developers with work environments that make the Flow possible. The problem is that it is very difficult. Here is a quick list of things a developer needs to get and stay into the Flow.

  1. Quiet work environment. No one will get into the flow if there are people talking around them all day long. Ideally, developers should have individual rooms with doors, or at most, small rooms where two or three people work together. This goes against some Agile evangelists that encourage the open spaces, under the theory that this promotes communication. But people who encourage open spaces clearly don't do programming as their main activity.

  2. Kill other distractions. Things like IM software and mail popups can kill the concentration very easily. So developers should be allowed, and actually encouraged, to close their IM and email software whenever they like.

  3. Quick compile cycles. Having a quiet environment is no good if it takes you too long to compile the software. And too long in this case is anything longer than a few seconds. If this happens, the developer will most likely switch to the web browser to check the latest news, or switch to the email client to check the last messages, and the hard earned Flow goes down the drain. Using Test Driven Development helps you to stay in the flow, because it is quicker to compile the code and run the tests as opposed to having to redeploy and restart the application every time.


Having said all that, most software companies don't do any of these. It is very hard to convince upper management to give individual rooms to developers. It is also very hard to convince them that too much information at once is a bad thing and it is OK to turn off the internal IM and email sofware.

But even if you don't have the ideal work environment, there are still things you can do by yourself. You can avoid asking other people silly questions that you can figure out by yourself with a little research. Otherwise, To save yourself a couple of minutes, you make the other person lose 10 minutes to get back into the Flow.

You can also take actions to isolate yourself from the surrounding distractions. The approached preferred by 9 in 10 developers is to listen to music in a headphone. I don't like to listen to music all day, so I sometimes listen to pink noise. You can also try noise canceling headphones.

Another suggestion is to move your desk so that you are not looking at the aisle. Try facing a window, a wall, or a corner of the office that has less movement. This avoids visual distractions.

Neal Ford suggests in his The Productive Programmer book that the team institutes a Quiet Time, a few hours a day when people should be left alone to work and chatter in the office should be avoided. He says that in the team where it was instituted, people looked forward to those hours as the most anticipated period of the day. In order to work, though, this has to be agreed by the team members so that everyone respects it.

Getting in the flow is very difficult, and most managers don't value it since managers are supposed to multitask so the Flow is not important to them. Setting up the environment so that the team can achieve the Flow state can be a daunting task, but if this is done, the productivity gains will be worth the effort.

More information:
Peopleware is a must read book for any software development manager. It discusses the concept of Flow and also good work environments for developers.