I spend a lot of time talking to different developers of varying skill levels. One common question I see a lot has to do with serialization of objects. When Java was first released, your options were pretty simple: use the binary serialization provided by the JDK or roll your mechanism. While your choices were simple, neither solution was really all that simple in practice. So after seeing this question come up on several occasions, I decided I would write an article and discuss the basic use of serialization in hopes it would help some of these developers get at least a basic understanding of how XML serialization can be used.
Problems with Serialization
The "standard" JDK serialization produces a binary file with various encodings in the file including a version flag. This makes it difficult to deserialize that object in another VM. If the VM versions differ, it may refuse entirely to deserialize the object. The VM version differences could be as minor as a minor version release. The serialized object is also in a binary format. While this is useful from a performance standpoint, it can make debugging more difficult. It also makes manual migration of serialized data next to impossible to achieve without intimate knowledge of the format. This binary format makes it difficult to serialize data for use in an external tool. If that tool is Java-based, then it's a manageable task. But for non-Java tools, it would require some form of conversion to a usable format. That solution has its own obvious problems: an extra step in the chain, potential conversion errors, maintenance issues, etc. Despite these problems, the JDK evolved for years without really addressing the issue until 1.4 introduced XML serialization.
With XML serialization, the JDK finally provided a portable mechanism for serializing data that could survive JVM upgrades and is readily usable by external tools. There are several competing serialization tools, however, that are, for many, superior choices. This article will compare and contrast two different libraries with the standard JDK option: JSX and XStream. I know that JSX has been ended in favor of JSX2 but JSX2 is only available commercially and I have fairly extensive experience with the original JSX. I realize it seems a bit unfair to compare an older version, but since it's freely available it's more likely to be used in open source projects. (Update: the author has taken down the download link for JSX so neither are available at this point.) JSX2 is largely source compatible (only the XML output is much, much cleaner) so it should still be fairly relevant should you decide to purchase JSX2. (The author of JSX2 has informed that he is phasing out JSX2 as product so the point is especially moot.) So with that out of the way, let's get to it.
For the purposes of this article, I have written two fairly simple objects that will hopefully sufficently demonstrate the basic principles. I have a Person
object that has a list of Items. To test all this, I've written a simple program to run through each approach. As you can see, there is not much difference in terms of effort to use any of these options.
XMLEncoder takes on extra step to close the encoder but that's the only major difference you will see in your code. Deserialization is also really simple with all three approaches. The differences you'll find are all in the XML and those differences can be quite dramatic.
Let's take a look at XMLEncoder first since it comes with the JDK and is most accessible. As you can see from the generated XML, the output is quite verbose. If you look closely, you will see that the XML is a step-by-step set of instructions of what methods the JVM should call to rebuild this object in memory. If you look even more closely, you will notice that not all of our data is there. The
created field defined in
Person is missing. This is because there are no getter and setter methods for
created. Without these methods,
XMLEncoder can not serialize and deserialize this field. Granted, it is easy enough to add these methods to make
XMLEncoder work, but having internal fields is very common and it is not always wise or desirable to expose these fields. That aside, it hardly seems reasonable to expect a developer to change potentially hundreds of classes to accommodate a serialization scheme given the range of options available.
I have used JSX on previous jobs to do some pretty intense logging. Basically, our set up was to use JSX to serialize events so that they could be played back a later date. This means that the process had to be fast and rock solid. To its credit, JSX could serialize anything we threw at it and was very responsive. The main complaint I had about JSX was the XML output. The XML is fairly verbose as you can see. Typically, this is not a problem as most of the time tools will be processing this data and you will not have much need to deal with it manually. In our experience, however, there is definitely a debugging burden for complex object structures. Another downside is that dates get serialized as milliseconds since the epoch. I am sure this makes it faster to deserialize but it makes visual inspection difficult. Still, overall, it is very easy to use and not overly difficult to read.
XStream is a new library to me. I have seen it come up once or twice in discussion but have never used it in practice. XStream is, in part, one of the reasons for this article. Like the other two mechanisms, XStream is very simple to use. Probably the biggest differentiator for me with XStream is the XML it generates. As you can see with the date, it is a human readable form as well. This makes auditing the XML so much easier than, say, with JSX because you do not have to convert milliseconds since the epoch to some understandable form. It's a nit, but I like how there is one class to deal with for serialiaztions and deserializations. The API is fairly straight-forward without any hoops to jump through just to get started. It provides some nice integration with standard serialization mechanisms. One can create a standard ObjectOutputStream and drop it into existing code without the system being aware that it is using XStream's ObjectOutputStream. Obviously, this can be incredibly useful when migrating existing serialization code without needing to touch too much code. Couple this with an IoC framework like Spring and it might not even need a recompile.
The heart of any XML serializer is the conversion mechanisms to convert the data to and from the XML files. Of the three libraries mentioned here, only XMLEncoder and XStream provide obvious mechanisms for overriding the built-in converters. XMLEncoder has PersistenceDelegates that can be assigned to alter the generated XML but no obvious mechanism exists for the trip back. XStream on the other hand has a very nice list of defined converters to serialize a variety of data types. These are the default converters that XStream will use. Registering a custom converter is very simple, though. The following code snippet demonstrates how easy it is to do:
Converters have a simple interface to implement and provide an easy extension mechanism for complicated domain objects or custom XML formats. As it stands, the default converters are more than sufficient for most needs, but the added flexibility can come in handy in certain situations.
Upon seeing this, it is easy to ask, "So what?" What are some uses for serialization? There are actually quite a few. At a prior job, we shipped not only code but data to our clients. For initial installs, this was not a problem but for existing customers, we had to merge our updated data with any changes the client might have made on site. To accomplish this, we recorded every call into certain session beans. This data was then serialized (using JSX) into a log file. When an update shipped to a client, this XML was reconstituted back into the recorded events and the events were played back into the system. A simple table copy would have clobbered the client's data, but by serializing these events, we were able to recreate our data team's efforts on site and successfully merge the two data sets. Another use I heard the other that I thought was a pretty clever solution to a common problem is deep cloning. By serializing an object tree into an intermediate form (in this case, XML) and then rebuilding that tree into new objects, a deep clone can be obtained with little effort. If the objects happen to be persistent objects, the IDs should not be cloned. By using customer converters, these attributes can be filtered out of the XML so that the clones can be assigned new IDs for true cloning. I am not recommending this as an especially fast way to achieve deep cloning, but it is an extremely cheap option from a development perspective.
As we have seen, using serialization is almost trivial. There are some tricky aspects when the default techniques are not enough but extending these systems is not terribly difficult either. In my experience, using XML serialization rather than the original binary format need not be overly slow either. So which approach would I recommend? If you are of the type that prefers to use the JDK options where possible, then XMLEncoder is the obvious choice. Personally, I am not constrained by such preferences so I would recommend XStream. It is actively developed and produces nice, clean XML and is easily extendible.