Object Input/Output Streams and Serialization in Java

Using a fixed-length record format is a good choice if you need to store data of the same type. However, objects that you create in an object-oriented program are rarely all of the same type. For example, you might have an array called staff that is nominally an array of Employee records but contains objects that are actually instances of a subclass such as Manager.

It is certainly possible to come up with a data format that allows you to store such polymorphic collections—but, fortunately, we don’t have to. The Java language supports a very general mechanism, called object serialization, that makes it possible to write any object to an output stream and read it again later. (You will see in this chapter where the term “serialization” comes from.)

1. Saving and Loading Serializable Objects

To save object data, you first need to open an ObjectOutputStream object:

var out = new ObjectOutputStream(new FileOutputStream(“employee.dat”));

Now, to save an object, simply use the writeObject method of the ObjectOutputStream class as in the following fragment:

var harry = new Employee(“Harry Hacker”, 50000, 1989, 10, 1);

var boss = new Manager(“Carl Cracker”, 80000, 1987, 12, 15);

out.writeObject(harry);

out.writeObject(boss);

To read the objects back in, first get an ObjectInputStream object:

var in = new ObjectInputStream(new FileInputStream(“employee.dat”));

Then, retrieve the objects in the same order in which they were written, using the readObject method:

var e1 = (Employee) in.readObject();

var e2 = (Employee) in.readObject();

There is, however, one change you need to make to any class that you want to save to an output stream and restore from an object input stream. The class must implement the Serializable interface:

class Employee implements Serializable { . . . }

The Serializable interface has no methods, so you don’t need to change your classes in any way. In this regard, it is similar to the Cloneable interface that we discussed in Volume I, Chapter 6. However, to make a class cloneable, you still had to override the clone method of the Object class. To make a class serializable, you do not need to do anything else.

Behind the scenes, an ObjectOutputStream looks at all the fields of the objects and saves their contents. For example, when writing an Employee object, the name, date, and salary fields are written to the output stream.

However, there is one important situation to consider: What happens when one object is shared by several objects as part of their state?

To illustrate the problem, let us make a slight modification to the Manager class. Let’s assume that each manager has a secretary:

class Manager extends Employee

{

private Employee secretary;

…

}

Each Manager object now contains a reference to an Employee object that describes the secretary. Of course, two managers can share the same secretary, as is the case in Figure 2.5 and the following code:

var harry = new Emptoyee(“Harry Hacker”, . . .);

var carl = new Manager(“Carl Cracker”, . . .);

carl.setSecretary(harry);

var tony = new Manager(“Tony Tester”, . . .);

tony.setSecretary(harry);

Saving such a network of objects is a challenge. Of course, we cannot save and restore the memory addresses for the secretary objects. When an object is reloaded, it will likely occupy a completely different memory address than it originally did.

Instead, each object is saved with the serial number—hence the name object serialization for this mechanism. Here is the algorithm:

Associate a serial number with each object reference that you encounter (as shown in Figure 2.6)
When encountering an object reference for the first time, save the object data to the output stream.
If it has been saved previously, just write “same as the previously saved object with serial number x.”

When reading the objects back, the procedure is reversed.

When an object is specified in an object input stream for the first time, construct it, initialize it with the stream data, and remember the association between the serial number and the object reference.
When the tag “same as the previously saved object with serial number x” is encountered, retrieve the object reference for the sequence number.

Listing 2.3 is a program that saves and reloads a network of Employee and Manager objects (some of which share the same employee as a secretary). Note that the secretary object is unique after reloading—when newStaff[1] gets a raise, that is reflected in the secretary fields of the managers.

Listing 2.3 objectStream/ObjectStreanTest.java

2. Understanding the Object Serialization File Format

Object serialization saves object data in a particular file format. Of course, you can use the writeObject/readObject methods without having to know the exact sequence of bytes that represents objects in a file. Nonetheless, we found studying the data format extremely helpful for gaining insight into the object
serialization process. As the details are somewhat technical, feel free to skip this section if you are not interested in the implementation.

Every file begins with the two-byte “magic number”

AC ED

followed by the version number of the object serialization format, which is currently

00 05

(We use hexadecimal numbers throughout this section to denote bytes.) Then, it contains a sequence of objects, in the order in which they were saved.

String objects are saved as

74 two-byte characters

length

For example, the string “Harry” is saved as

74 00 05 Harry

The Unicode characters of the string are saved in the “modified UTF-8” format.

When an object is saved, the class of that object must be saved as well. The class description contains

The name of the class
The serial version unique ID, which is a fingerprint of the data field types and method signatures
A set of flags describing the serialization method
A description of the data fields

The fingerprint is obtained by ordering the descriptions of the class, superclass, interfaces, field types, and method signatures in a canonical way, and then applying the so-called Secure Hash Algorithm (SHA) to that data.

SHA is a fast algorithm that gives a “fingerprint” of a larger block of information. This fingerprint is always a 20-byte data packet, regardless of the size of the original data. It is created by a clever sequence of bit operations on the data that makes it essentially 100 percent certain that the fingerprint will change if the information is altered in any way. (For more details on SHA, see, for example, Cryptography and Network Security, Seventh Edition by William Stallings, Prentice Hall, 2016.) However, the serialization mechanism uses only the first eight bytes of the SHA code as a class fingerprint. It is still very likely that the class fingerprint will change if the data fields or methods change.

When reading an object, its fingerprint is compared against the current fingerprint of the class. If they don’t match, it means the class definition has changed after the object was written, and an exception is generated. Of course, in practice, classes do evolve, and it might be necessary for a program to read in older versions of objects. We will discuss this in Section 2.3.5, “Versioning,” on p. 103.

Here is how a class identifier is stored:

72
2-byte length of class name
Class name
8-byte fingerprint
1-byte flag
2-byte count of data field descriptors
Data field descriptors
78 (end marker)
Superclass type (70 if none)

The flag byte is composed of three bit masks, defined in java.io .ObjectStreamConstants:

static final byte SC_WRITE_METHOD = 1;

// class has a writeObject method that writes additional data

static final byte SC_SERIALIZABLE = 2;

// class implements the Serializable interface

static final byte SC_EXTERNALIZABLE = 4;

// class implements the Externalizable interface

We discuss the Externalizable interface later in this chapter. Externalizable classes supply custom read and write methods that take over the output of their instance fields. The classes that we write implement the Serializable interface and will have a flag value of 02. The serializable java.util.Date class defines its own readObject/writeObject methods and has a flag of 03.

Each data field descriptor has the format:

1-byte type code
2-byte length of field name
Field name
Class name (if the field is an object)

The type code is one of the following:

B byte

C char

D double

F float

I int

J long

L object

S short

Z boolean

[ array

When the type code is L, the field name is followed by the field type. Class and field name strings do not start with the string code 74, but field types do. Field types use a slightly different encoding of their names—namely, the format used by native methods.

For example, the salary field of the Employee class is encoded as

D 00 06 salary

Here is the complete class descriptor of the Employee class:

These descriptors are fairly long. If the same class descriptor is needed again in the file, an abbreviated form is used:

71 4-byte serial number

The serial number refers to the previous explicit class descriptor. We discuss the numbering scheme later.

An object is stored as

73 class descriptor object data

For example, here is how an Employee object is stored:

As you can see, the data file contains enough information to restore the Employee object.

Arrays are saved in the following format:

75 class descriptor 4-byte number of entries entries

The array class name in the class descriptor is in the same format as that used by native methods (which is slightly different from the format used by class names in other class descriptors). In this format, class names start with an L and end with a semicolon.

For example, an array of three Employee objects starts out like this:

Note that the fingerprint for an array of Employee objects is different from a fingerprint of the Employee class itself.

All objects (including arrays and strings) and all class descriptors are given serial numbers as they are saved in the output file. The numbers start at 00 7E 00 00.

We already saw that a full class descriptor for any given class occurs only once. Subsequent descriptors refer to it. For example, in our previous example, a repeated reference to the Date class was coded as

71 00 7E 00 08

The same mechanism is used for objects. If a reference to a previously saved object is written, it is saved in exactly the same way—that is, 71 followed by
the serial number. It is always clear from the context whether a particular serial reference denotes a class descriptor or an object.

Finally, a null reference is stored as

Here is the commented output of the ObjectRefTest program of the preceding section. Run the program, look at a hex dump of its data file employee.dat, and compare it with the commented listing. The important lines toward the end of the output show a reference to a previously saved object.

Of course, studying these codes can be about as exciting as reading a phone book. It is not important to know the exact file format (unless you are trying to create an evil effect by modifying the data), but it is still instructive to know that the serialized format has a detailed description of all the objects it contains, with sufficient detail to allow reconstruction of both objects and arrays of objects.

What you should remember is this:

The serialized format contains the types and data fields of all objects.
Each object is assigned a serial number.
Repeated occurrences of the same object are stored as references to that serial number.

3. Modifying the Default Serialization Mechanism

Certain data fields should never be serialized—for example, integer values that store file handles or handles of windows that are only meaningful to native methods. Such information is guaranteed to be useless when you reload an object at a later time or transport it to a different machine. In fact, improper values for such fields can actually cause native methods to crash. Java has an easy mechanism to prevent such fields from ever being serialized: Mark them with the keyword transient. You also need to tag fields as transient if they belong to nonserializable classes. Transient fields are always skipped when objects are serialized.

The serialization mechanism provides a way for individual classes to add validation or any other desired action to the default read and write behavior. A serializable class can define methods with the signature

private void readObject(ObjectInputStream in)

throws IOException, CtassNotFoundException;

private void writeObject(ObjectOutputStream out)

throws IOException;

Then, the data fields are no longer automatically serialized—these methods are called instead.

Here is a typical example. A number of classes in the java.awt.geom package, such as Point2D.Double, are not serializable. Now, suppose you want to serialize a class LabetedPoint that stores a String and a Point2D.Double. First, you need to mark the Point2D.Doubte field as transient to avoid a NotSeriatizabteException.

public class LabetedPoint implements Serializable

{

private String tabet;

private transient Point2D.Doubte point;

…

}

In the writeObject method, we first write the object descriptor and the String field, label, by calling the defauttWriteObject method. This is a special method of the ObjectOutputStream class that can only be called from within a writeObject method of a serializable class. Then we write the point coordinates, using the standard DataOutput calls.

private void writeObject(ObjectOutputStream out) throws IOException

{

out.defaultWriteObject();

out.writeDouble(point.getX());

out.writeDouble(point.getY());

}

In the readObject method, we reverse the process:

private void readObject(ObjectInputStream in) throws IOException

{

in.defaultReadObject();

double x = in.readDouble();

double y = in.readDouble();

point = new Point2D.Double(x, y);

}

Another example is the java.util.Date class that supplies its own readObject and writeObject methods. These methods write the date as a number of milliseconds from the epoch (January 1, 1970, midnight UTC). The Date class has a complex internal representation that stores both a Calendar object and a millisecond count to optimize lookups. The state of the Calendar is redundant and does not have to be saved.

The readObject and writeObject methods only need to save and load their data fields. They should not concern themselves with superclass data or any other class information.

Instead of letting the serialization mechanism save and restore object data, a class can define its own mechanism. To do this, a class must implement the Externalizable interface. This, in turn, requires it to define two methods:

public void readExternal(ObjectInput in)

throws IOException, ClassNotFoundException;

public void writeExternal(ObjectOutput out)

throws IOException;

Unlike the readObject and writeObject methods that were described in the previous section, these methods are fully responsible for saving and restoring the entire object, including the superclass data. When writing an object, the serialization mechanism merely records the class of the object in the output stream. When reading an externalizable object, the object input stream creates an object with the no-argument constructor and then calls the readExternal method. Here is how you can implement these methods for the Employee class:

public void readExternal(ObjectInput s) throws IOException

{

name = s.readUTF();

salary = s.readDouble();

hireDay = LocalDate.ofEpochDay(s.readLong());

}

public void writeExternal(ObjectOutput s) throws IOException

{

s.writeUTF(name);

s.writeDoubte(satary);

s.writeLong(hireDay.toEpochDay());

}

4. Serializing Singletons and Typesafe Enumerations

You have to pay particular attention to serializing and deserializing objects that are assumed to be unique. This commonly happens when you are implementing singletons and typesafe enumerations.

If you use the enum construct of the Java language, you need not worry about serialization—it just works. However, suppose you maintain legacy code that contains an enumerated type such as

public class Orientation

{

public static final Orientation HORIZONTAL = new Orientation(1);

public static final Orientation VERTICAL = new Orientation(2);

private int value;

private Orientation(int v) { value = v; }

}

This idiom was common before enumerations were added to the Java language. Note that the constructor is private. Thus, no objects can be created beyond Orientation.HORIZONTAL and Orientation.VERTICAL. In particular, you can use the == operator to test for object equality:

if (orientation == Orientation.HORIZONTAL) . . .

There is an important twist that you need to remember when a typesafe enumeration implements the Serializable interface. The default serialization mechanism is not appropriate. Suppose we write a value of type Orientation and read it in again:

Orientation original = Orientation.HORIZONTAL;

ObjectOutputStream out = . . .;

out.write(original);

out.close();

ObjectInputStream in = . . .; var saved = (Orientation) in.read();

Now the test

if (saved == Orientation.HORIZONTAL) . . .

will fail. In fact, the saved value is a completely new object of the Orientation type that is not equal to any of the predefined constants. Even though the constructor is private, the serialization mechanism can create new objects!

To solve this problem, you need to define another special serialization method, called readResolve. If the readResolve method is defined, it is called after the object is deserialized. It must return an object which then becomes the return value of the readObject method. In our case, the readResolve method will inspect the value field and return the appropriate enumerated constant:

protected Object readResolve() throws ObjectStreamException

{

if (value == 1) return Orientation.HORIZONTAL;

if (value == 2) return Orientation.VERTICAL;

throw new ObjectStreamException(); // this shouldn’t happen

}

Remember to add a readResolve method to all typesafe enumerations in your legacy code and to all classes that follow the singleton design pattern.

5. Versioning

If you use serialization to save objects, you need to consider what happens when your program evolves. Can version 1.1 read the old files? Can the users who still use 1.0 read the files that the new version is producing? Clearly, it would be desirable if object files could cope with the evolution of classes.

At first glance, it seems that this would not be possible. When a class definition changes in any way, its SHA fingerprint also changes, and you know that object input streams will refuse to read in objects with different fingerprints. However, a class can indicate that it is compatible with an earlier version of itself. To do this, you must first obtain the fingerprint of the earlier version of the class. Use the standalone seriatver program that is part of the JDK to obtain this number. For example, running

seriatver Employee

prints

Employee: static final tong seriatVersionUID = -1814239825517340645L;

All later versions of the class must define the serialVersionUID constant to the same fingerprint as the original.

class Employee implements Serializable // version 1.1

{

…

public static final long serialVersionUID = -1814239825517340645L;

}

When a class has a static data member named seriatVersionUID, it will not compute the fingerprint manually but will use that value instead.

Once that static data member has been placed inside a class, the serialization system is now willing to read in different versions of objects of that class.

If only the methods of the class change, there is no problem with reading the new object data. However, if the data fields change, you may have problems. For example, the old file object may have more or fewer data fields than the one in the program, or the types of the data fields may be different. In that case, the object input stream makes an effort to convert the serialized object to the current version of the class.

The object input stream compares the data fields of the current version of the class with those of the version in the serialized object. Of course, the object input stream considers only the nontransient and nonstatic data fields. If two fields have matching names but different types, the object input stream makes no effort to convert one type to the other—the objects are incompatible. If the serialized object has data fields that are not present in the current version, the object input stream ignores the additional data. If the current version has data fields that are not present in the serialized object, the added fields are set to their default (nutt for objects, zero for numbers, and fatse for bootean values).

Here is an example. Suppose we have saved a number of employee records on disk, using the original version (1.0) of the class. Now we change the Employee class to version 2.0 by adding a data field called department. Figure 2.7 shows what happens when a 1.0 object is read into a program that uses 2.0 objects. The department field is set to null. Figure 2.8 shows the opposite scenario: A program using 1.0 objects reads a 2.0 object. The additional department field is ignored.

Is this process safe? It depends. Dropping a data field seems harmless—the recipient still has all the data that it knows how to manipulate. Setting a data field to null might not be so safe. Many classes work hard to initialize all

data fields in all constructors to non-nutt values, so that the methods don’t have to be prepared to handle nutt data. It is up to the class designer to implement additional code in the readObject method to fix version incompatibilities or to make sure the methods are robust enough to handle nutt data.

6. Using Serialization for Cloning

There is an amusing use for the serialization mechanism: It gives you an easy way to clone an object, provided the class is serializable. Simply serialize it to an output stream and then read it back in. The result is a new object that is a deep copy of the existing object. You don’t have to write the object to a file—you can use a ByteArrayOutputStream to save the data into a byte array.

As Listing 2.4 shows, to get ctone for free, simply extend the SeriatCtoneabte class, and you are done.

You should be aware that this method, although clever, will usually be much slower than a clone method that explicitly constructs a new object and copies or clones the data fields.

Listing 2.4 seriatCtone/SeriatCtoneTest.java