Serialization and deserialization have different implications when it comes to type safety. Any data that comes into a program from the outside can't be trusted to be coherent, and any serialization logic can't be trusted to produce something that actually represents the original data. There are better and worse ways to go about dealing with serialization, and I'll explore a few here. Of course, I'm assuming that the program doing the (de)serialization is doing static type checking.
Serialization and deserialization are type-unsafe in different, asymmetric ways:
Serialization is usually infallible (modulo IO issues), since a struct has all data that it requires to produce its serialized representation. Deserialization, on the other hand, is fallible, since it might encounter bad or incomplete data coming from program input.
Deserialization is inherently type-safe if we assume that a struct can only be created by calling a constructor with valid arguments. This naturally leads to deserialization code that tries (and potentially safely fails) to produce all required constructor arguments out of the unsafe input data. Serialization is inherently unsafe, as there is no way a compiler can tell that the serialization code is dumping all relevant fields, and that it is doing so in a way that can the be read back properly.
(de)serialization strategies that rely on runtime type information (e.g. Java's or Pyton's reflection capabilities) are clearly unsafe; If the serialization logic relies on runtime type information then it can't be verified by the compiler, which can lead to code that fails at runtime when trying to serialize a value that is either not serializable (e.g.: pointers, locks) or not meant to be serialized (e.g.: caches, derived values, internal values). A better, safer strategy is that of code generation, like what is done by the “serde” crate in Rust; for every unannotated field in a struct, serde will generate code that produces its serialized version. The fact that it generates actual Rust code gives it an extra level of safety since the compiler gets to see the serialization code, flagging attempts to serialize unserializable values at compile time.
This strategy is pretty solid, but there is still room for unsafety in the form of serialized data being insufficient for deserialization; it's still possible to generate serialization code that serializes a struct into something that can never be deserialized due to e.g. not serializing a required field by annotating it with #[serde(skip_serializing)], and once again the compiler can't help us there, since it doesn't really know that the output of the serialization method is supposed to be the input of the deserialization function.
Instead of serializing directly to the output format (e.g. raw bytes or a JSON string), one could first convert a data struct into an intermediate “message” type, specific to every struct that needs (de)serialization. So, for example, a Person
struct would be converted to a PersonMessage
struct, and only the PersonMessage
struct would then be serializable to e.g. JSON. Then, by implementing methods like
PersonMessage::from(person: Person) -> PersonMessage
and
Person::from(message: &PersonMessage) -> Person
we ensure serialization and deserialization are tied together; If ever PersonMessage
becomes insufficient for deserialization, the compiler will flag that insufficiency in the Person::from_message
constructor method.
It is still necessary to manually ensure that ALL fields of the *Message
types are being serialized, but I find that to be easier to enforce by policy since those types only exist for serialization purposes and will never have fields that are not meant to be serialized (e.g. private internal values). Specifically for Rust's serde, this can be conveniently achieved by annotating structs like so:
This is obviously very verbose, and it's likely that some macros could help keep the boilerplate down. The logic that converts between Person
and PersonMessage
has to be written manually, even if the logic that actually serializes PersonMessage
can be auto-generated. This isn't necessarily bad, though, as often some validation has to be done on the message after deserialization. For example, maybe the logic that converts PersonMessage
to Person
would check that person_message.age < 130
.
So if you're obsessing about serialization correctness, I can recommend “Message” types as pretty decent solution =)