Table of Contents:
I spend a lot of time talking to different developers of varying skill
levels. One common question I see a lot has to do with serialization of
objects. When Java was first released, your options were pretty simple:
use the binary serialization provided by the JDK or roll your mechanism.
While your choices were simple, neither solution was really all that simple
in practice. So after seeing this question come up on several occasions,
I decided I would write an article and discuss the basic use of
serialization in hopes it would help some of these developers get at least
a basic understanding of how XML serialization can be used.
The "standard" JDK serialization produces a binary file with various
encodings in the file including a version flag. This makes it difficult to
deserialize that object in another VM. If the VM versions differ, it may
refuse entirely to deserialize the object. The VM version differences
could be as minor as a minor version release. The serialized object is
also in a binary format. While this is useful from a performance
standpoint, it can make debugging more difficult. It also makes manual
migration of serialized data next to impossible to achieve without intimate
knowledge of the format.
This binary format makes it difficult to serialize data for use in an
external tool. If that tool is Java-based, then it's a manageable task.
But for non-Java tools, it would require some form of conversion to a
usable format. That solution has its own obvious problems: an extra
step in the chain, potential conversion errors, maintenance issues, etc.
Despite these problems, the JDK evolved for years without really addressing
the issue until 1.4 introduced XML serialization.
With XML serialization, the JDK finally provided a portable mechanism for
serializing data that could survive JVM upgrades and is readily usable by
external tools. There are several competing serialization tools, however,
that are, for many, superior choices. This article will compare and
contrast two different libraries with the standard JDK option:
JSX and
XStream.
I know that JSX has been ended in favor of JSX2 but JSX2 is only available
commercially and I have fairly extensive experience with the original JSX.
I realize it seems a bit unfair to compare an older version, but since it's
freely available it's more likely to be used in open source projects.
(Update: the author has taken down the download link for JSX so neither are
available at this point.) JSX2 is largely source compatible (only the XML
output is much, much cleaner) so it should still be fairly relevant should
you decide to purchase JSX2. (The author of JSX2 has informed that he is
phasing out JSX2 as product so the point is especially moot.) So with that
out of the way, let's get to it.
For the purposes of this article, I have written two fairly simple objects
that will hopefully sufficently demonstrate the basic principles. I have
a
Person object that
has a list of
Items.
To test all this, I've written a simple
program to run through
each approach.
As you can see, there is not much difference in terms of effort to use
any of these options. XMLEncoder takes on extra step to
close the encoder but that's the only major difference you will see in your
code. Deserialization is also really simple with all three approaches. The
differences you'll find are all in the XML and those differences can be
quite dramatic.
Let's take a look at XMLEncoder first since it comes with the JDK and is
most accessible. As you can see from the generated
XML, the output is quite verbose. If you
look closely, you will see that the XML is a step-by-step set of
instructions of what methods the JVM should call to rebuild this object
in memory. If you look even more closely, you will notice that not all
of our data is there. The
created field defined in
Person is missing. This is because there are no getter and
setter methods for
created. Without these methods,
XMLEncoder can not serialize and deserialize this field.
Granted, it is easy enough to add these methods to make
XMLEncoder work, but having internal fields is very common
and it is not always wise or desirable to expose these fields. That aside,
it hardly seems reasonable to expect a developer to change potentially
hundreds of classes to accommodate a serialization scheme given the range
of options available.
I have used JSX on previous jobs to do some pretty intense logging.
Basically, our set up was to use JSX to serialize events so that they could
be played back a later date. This means that the process had to be fast
and rock solid. To its credit, JSX could serialize anything we threw at it
and was very responsive.
The main complaint I had about JSX was the XML output. The XML is fairly verbose as you can see.
Typically, this is not a problem as most of the time tools will be processing
this data and you will not have much need to deal with it manually. In our
experience, however, there is definitely a debugging burden for complex
object structures. Another downside is that dates get serialized as
milliseconds since the epoch. I am sure this makes it faster to deserialize
but it makes visual inspection difficult. Still, overall, it is very easy
to use and not overly difficult to read.
XStream is a new library to me. I have seen it come up once or twice in
discussion but have never used it in practice. XStream is, in part, one of
the reasons for this article. Like the other two mechanisms, XStream is
very simple to use. Probably the biggest differentiator for me with XStream
is the
XML it generates. As you can see
with the date, it is a human readable form as well. This makes auditing
the XML so much easier than, say, with JSX because you do not have to
convert milliseconds since the epoch to some understandable form.
It's a nit, but I like how there is one class to deal with for
serialiaztions and deserializations. The API is fairly straight-forward
without any hoops to jump through just to get started. It provides some
nice integration with standard serialization mechanisms. One can create
a standard ObjectOutputStream and drop it into existing code
without the system being aware that it is using XStream's
ObjectOutputStream. Obviously, this can be incredibly useful when
migrating existing serialization code without needing to touch too much
code. Couple this with an IoC framework like Spring and it might not even
need a recompile.
The heart of any XML serializer is the conversion mechanisms to convert the
data to and from the XML files. Of the three libraries mentioned here,
only XMLEncoder and XStream provide obvious mechanisms for overriding the
built-in converters.
XMLEncoder has
PersistenceDelegates that can be assigned to alter the generated XML
but no obvious mechanism exists for the trip back. XStream on the other
hand has a very nice
list of defined converters to serialize a variety of data types. These
are the default converters that XStream will use. Registering a custom
converter is very simple, though. The following code snippet demonstrates
how easy it is to do:
xstream.registerConverter(new CustomDomainObjectConverter());
Converters have a simple interface to implement and provide an easy
extension mechanism for complicated domain objects or custom XML formats.
As it stands, the default converters are more than sufficient for most
needs, but the added flexibility can come in handy in certain situations.
Upon seeing this, it is easy to ask, "So what?" What are some uses for
serialization? There are actually quite a few. At a prior job, we shipped
not only code but data to our clients. For initial installs, this was not
a problem but for existing customers, we had to merge our updated data with
any changes the client might have made on site.
To accomplish this, we recorded every call into certain session beans.
This data was then serialized (using JSX) into a log file. When an update
shipped to a client, this XML was reconstituted back into the recorded
events and the events were played back into the system. A simple table copy
would have clobbered the client's data, but by serializing these events, we
were able to recreate our data team's efforts on site and successfully merge
the two data sets.
Another use I heard the other that I thought was a pretty clever solution
to a common problem is deep cloning. By serializing an object tree into an
intermediate form (in this case, XML) and then rebuilding that tree into new
objects, a deep clone can be obtained with little effort. If the objects
happen to be persisten objects, the IDs should not be cloned. By using
customer converters, these attributes can be filtered out of the XML so that
the clones can be assigned new IDs for true cloning. I am not recommending
this as an especially fast way to achieve deep cloning, but it is an
extremely cheap option from a development perspective.
As we have seen, using serialization is almost trivial. There are some
tricky aspects when the default techniques are not enough but extending
these systems is not terribly difficult either. In my experience, using
XML serialization rather than the original binary format need not be overly
slow either. So which approach would I recommend? If you are of the type
that prefers to use the JDK options where possible, then XMLEncoder is the
obvious choice. Personally, I am not constrained by such preferences so I
would recommend XStream. It is actively developed and produces nice, clean
XML and is easily extendible.