Chasing Programming Wisdom

Ramblings by Tomaz Vieira

(Almost) Safe Serialization

Serialization and deserialization have different implications when it comes to type safety. Any data that comes into a program from the outside can't be trusted to be coherent, and any serialization logic can't be trusted to produce something that actually represents the original data. There are better and worse ways to go about dealing with serialization, and I'll explore a few here. Of course, I'm assuming that the program doing the (de)serialization is doing static type checking.

Serialization and deserialization are type-unsafe in different, asymmetric ways:

Fallibility

Serialization is usually infallible (modulo IO issues), since a struct has all data that it requires to produce its serialized representation. Deserialization, on the other hand, is fallible, since it might encounter bad or incomplete data coming from program input.

Safety

Deserialization is inherently type-safe if we assume that a struct can only be created by calling a constructor with valid arguments. This naturally leads to deserialization code that tries (and potentially safely fails) to produce all required constructor arguments out of the unsafe input data. Serialization is inherently unsafe, as there is no way a compiler can tell that the serialization code is dumping all relevant fields, and that it is doing so in a way that can the be read back properly.

(de)serialization strategies that rely on runtime type information (e.g. Java's or Pyton's reflection capabilities) are clearly unsafe; If the serialization logic relies on runtime type information then it can't be verified by the compiler, which can lead to code that fails at runtime when trying to serialize a value that is either not serializable (e.g.: pointers, locks) or not meant to be serialized (e.g.: caches, derived values, internal values). A better, safer strategy is that of code generation, like what is done by the “serde” crate in Rust; for every unannotated field in a struct, serde will generate code that produces its serialized version. The fact that it generates actual Rust code gives it an extra level of safety since the compiler gets to see the serialization code, flagging attempts to serialize unserializable values at compile time.

This strategy is pretty solid, but there is still room for unsafety in the form of serialized data being insufficient for deserialization; it's still possible to generate serialization code that serializes a struct into something that can never be deserialized due to e.g. not serializing a required field by annotating it with #[serde(skip_serializing)], and once again the compiler can't help us there, since it doesn't really know that the output of the serialization method is supposed to be the input of the deserialization function.

A mitigating strategy - using an intermediate “Message” type

Instead of serializing directly to the output format (e.g. raw bytes or a JSON string), one could first convert a data struct into an intermediate “message” type, specific to every struct that needs (de)serialization. So, for example, a Person struct would be converted to a PersonMessage struct, and only the PersonMessage struct would then be serializable to e.g. JSON. Then, by implementing methods like

PersonMessage::from(person: Person) -> PersonMessage

and

Person::from(message: &PersonMessage) -> Person

we ensure serialization and deserialization are tied together; If ever PersonMessage becomes insufficient for deserialization, the compiler will flag that insufficiency in the Person::from_message constructor method.

It is still necessary to manually ensure that ALL fields of the *Message types are being serialized, but I find that to be easier to enforce by policy since those types only exist for serialization purposes and will never have fields that are not meant to be serialized (e.g. private internal values). Specifically for Rust's serde, this can be conveniently achieved by annotating structs like so:

This is obviously very verbose, and it's likely that some macros could help keep the boilerplate down. The logic that converts between Person and PersonMessage has to be written manually, even if the logic that actually serializes PersonMessage can be auto-generated. This isn't necessarily bad, though, as often some validation has to be done on the message after deserialization. For example, maybe the logic that converts PersonMessage to Person would check that person_message.age < 130.

So if you're obsessing about serialization correctness, I can recommend “Message” types as pretty decent solution =)