In-Memory Map Compression?

Suppose a Java app retains a huge HashMap in memory. The map keys are strings, and its values are simple classes holding various display attributes such as label, tooltip, preferred width, and other display attributes.

While the in-memory storage is fast, the maps are very large and consume a lot of RAM. Most of the data is text, however, and there are many similar, although not identical, tooltips and labels. So I’m wondering if anybody ever implemented in-memory map compression? Kind of like an in-memory ZIP archive?

Maybe this isn’t specific to maps, but rather a general-purpose in-memory Java object compression mechanism? It seems like any time you have a cache containing tens of thousands of objects, perhaps the cache could compress the data internally?

It’d probably be easier to just move this to a database, and then perhaps cache frequently accessed objects in memory, rather than worry about compression. Anyway, it’s just a random (geeky) thought.


12 Responses to “In-Memory Map Compression?”

Sam.Halliday Says:

Damn you! I’ve had this post lying around my drafts folder for months! ;-)

I was recently stung by a TreeSet of Strings… for a million entries I was getting about 40MB of *overhead*. (40 bytes per String and another 32 per Entry), see http://forum.java.sun.com/thread.jspa?threadID=5214543 for more details. And that’s before you even count the fact that String is 2 bytes long even if all the data is ASCII. The fact of the matter is that when you’re dealing with millions of objects, it really matters how much state is in each of those objects… even the overhead of Object itself starts to count!

For the record, my solution involved creating a very large String, and creating a much smaller index into it. Much smaller footprint, but much more complex code. If you have the RAM to spare… I’d suggest just living with it. In my case, it really had to fit in under 30MB.

Sam.Halliday Says:

sorry, that should have been ~80MB of overhead, not 40MB. And the data itself was only 20MB, so it was a real performance killer. Even UTF-8 -> ASCII compression techniques wouldn’t have worked there, the “bottleneck” was in the references, not the target objects. I can’t even begin to imagine what a transparent compression technique would be to deal with the structure itself.

Alex Miler Says:

Several things spring to mind in no organized fashion:

1) If many strings are identical, you could intern() them, although sounds like they may already be literal/constant strings and auto-interned

2) Or if many strings are identical *pattern* with inserted params, you could store them as pattern+values instead of with the values replaced and then your patterns could be interned.

3) You could use a cache that spills to disk (ehcache is quite good)

4) JSR 203 for Java 7 is NIO 2 (the new new IO :) and includes an overhaul of file system access. Part of the design is a filesystem abstraction, that should let you treat a zip archive file system just like your disk file system (and potentially just like an in-memory file system). From talking to those guys, memory-based file systems were definitely a consideration. I’m not sure that solves your problem at all, just an interesting related thought. :)

5) You could implement Map, proxy another Map and compress/decompress big strings in the proxy. That seems relatively straightforward.

6) Terracotta does big string compression like this automatically to avoid passing humongous strings around the system.

7) If your big strings happen to be XML, there are several optimized object forms out there for XML that are much more efficient than a string. Saxon has a TinyTree structure that’s pretty great and I believe Nux has some good stuff too. Plus there are some binary XML libs but I haven’t worked with those.

afsina Says:

i think an embedded database may be a solution for this, they have cached table support. H2 is a good choice IMO.
Another alternative may be db4o , it is very easy to use and pretty fast. it uses disk, i don’t know the cache options.
Or as mentioned earlier, Ehcache may be an alternative too.

Alex Miler Says:

Embedded db may be an overkill but if not, H2 and db40 are excellent options, although I’d also throw Derby and Berkeley DB in the mix as well.

womble Says:

20MB, whats the problem? just hash it, each entry uses a pointer - 4 bytes. 10,000 entries = 40K, done, for multiple entries add a linked list to the hash entry. Oh it’s java tee hee, lots of luck.

afsina Says:

womble: err.. ok you hashed the data cool. where are you storing the “actual” data?

womble Says:

a bunch of ways, offset into a memory mapped file, malloc it in, and so on. The problem was the overhead introduced by the java objects and hashmap? The usual way to do this in a non GC language is create a proxy object for each of your different types and create and destroy them on the fly (flyweight pattern I think it’s called), you could probably do this in java some how - not my area.

cheers

afisna Says:

womble:
it is too bold to talk about something related with java yet not use or know anything about it. i very much doubt that the overhead is significant using JDK maps, i would urge you to check the HashMap implementation in JDK or NIO classes for memory mapped files. Plus, you can go to the way you said by writing a custom map implementation using primitives, i am sure it will be easier than C-C++, but i doubt it worth the pain anyway.

I think what you are referring to is a property of using the Flyweight pattern in that collapsing/sharing references reduces memory footprints. I have done something similar when interacting with JDBC and RMI before. Say I’m joining two tables together, and one of those tables has largely similar data, and the other is very unique. For example maybe I’m joining to a types table or something like a tags table. Lots of things are tagging with the same string so why not store that tag just once. So in order to collapse that memory I pass all tags through a HashMap. Something like:

public class Flyweight implements Map {
private Map map = new HashMap();

public Object collapse( Object key ) {
if( !map.containsKey( key ) ) {
map.put( key, key );
}
return map.get( key );
}
}

Flyweight flyweight = new FlyWeight();
obj.setTag( flyweight.collapse( resultSet.getString( i ) );

That makes sure I get collapse all references to similar Strings down to one actual object, and all references point at that shared object. Using a HashSet might be better. Great for immutable data like String. We did this to compress the amount of data being held in memory, but also serialized the resulting structures over RMI. Java serialization is awesome in this regard because it will reconstruct the memory graph verbatim on the client side so the compress transports across the wire. We cut memory usage by 75% in some cases using this technique. It doesn’t have to be just strings though. String.intern() is a highly specialized version of this, but you have to be very careful with it because those things that are intern()’ed last the lifetime of the program.

Charlie

ashish Says:

1.
I would suggest to look into jms queue persistent mechanism. They use pretty fast file based persistence for fail-over.

2.
if u getting data from DB, fine tuning hibernate cache & lazy loading might work

3. if data source is different from DB, then persisting to file might be fastest. (compare to DB)

Pi Says:

There is a problem here. HashMap is usually used for accessing random keys whereas partial decompression is often done in blocks which is good for sequencial access. In terms of performance, the compressed map thing might not be sufficiently efficient.

Leave a Reply