Data Serialization Formats
There are a lot of data serialization formats out there. Largely, we can split them into two categories, simple serialization formats, and schema-based serialization formats.
Simple Serialization Formats
Simple serialization formats only encode data into bytes. They are self-contained, meaning they include all the information needed to parse the data.
Some examples of simple serialization formats are JSON, MessagePack, CBOR, and BSON.
Pros:
- Simple to implement.
- Well-understood and widely supported.
- Often human-readable.
Cons:
- No schema, so backwards/forwards compatibility is up to implementation, resulting in custom code or unofficial specifications like JSON Schema.
- No validation of data.
- Typically slower to read, parse, and write.
- Typically larger in size.
The reasons for their trade-offs are easy to understand. Since there is no schema, the deserializer must read the entire message, so you can’t skip ahead, and you can’t assume the data is in any particular format. The code paths must account for all types, which makes them slower.
Since there is no schema, the data must include explicit information about types, lengths, and other metadata, which makes the data larger.
To make it more clear, let’s look at an example using JSON.
{ "date": "2024-11-17T21:04:00.000Z", "name": "John Doe", "age": 30, "isActive": true }
When you receive this data, you don’t know if the date
is a string, a number, a boolean, or null. You need to read the value to know, then transform it into the native String type of the language you are using.
If you only care about the name
field, you still need to read the entire message. There is no way to skip ahead. Just because this message has a string name
field in the second position doesn’t mean the next message will too. It could be in another position, contain a different type, or be missing altogether.
Finally, the data is larger as each key is encoded as a string. If you have a large number of keys, this can add up.
However, there are some cases where simple serialization formats can be faster than schema-based serialization formats. For example, if you are sending JSON to a JavaScript client, JSON has no overhead since the client can parse them natively (running optimized compiled code), whereas using a schema-based format like Protocol Buffers requires the data to be deserialized using the JavaScript engine (or WebAssembly), which is slower.
Schema-Based Serialization Formats
Schema-based serializations require a schema to be defined before any data is serialized. This means that the schema is either known ahead of time, transported alongside the data, or exchanged between systems somehow.
Some examples of schema-based serialization formats are Protocol Buffers, Cap’n Proto, FlatBuffers, Avro, and Thrift.
Pros:
- Type safety.
- Data validation.
- Often smaller in size.
- Often faster to read, parse, and write.
Cons:
- More complex to implement.
- Requires a schema.
- Requires schema management.
Let’s look at an example using Cap’n Proto.
Person { name @0 :Text; age @1 :Int32; isActive @2 :Bool; }
When you receive the data, it will be in a binary format. The data is not human-readable, so you can’t just look at it and understand what it contains.
However, the data is type safe. If you know the schema, you know exactly what is in the message. You also know the order of the fields, the encoding used, and the size of each field, so you can skip ahead to the fields you care about. (Note: Not all schema-based formats support this.)
The data is often smaller in size because schema metadata is not included in the serialized data.
Next, we’ll look at each of these formats in more detail. For each format, we’ll look at its native types, how they are encoded, decoded, and how they handle schema evolution, schema transport, and ergonomics.
Protocol Buffers
Protocol Buffers, also known as protobuf, is a binary serialization format developed by Google. We’ll look at the version 3 specification, which is the latest at the time of this writing.
Types
Protocol Buffers supports the following types:
- Scalar types (numbers, booleans, strings, and bytes)
- Enum types
- Union types via
oneof
- Nested types
- List types via
repeated
- Map types
- Optional types via
optional
- Dynamic types via
any
Encoding/Decoding
Protobuf uses a base 128 varint encoding for all numeric types, which can be 1 to 10 bytes, with smaller numbers taking up fewer bytes.
The message structure follows a Tag-Length-Value (TLV) encoding, meaning that each field is encoded with a tag, length, and value. Tags can take up to 2 bytes, lengths up to 10 bytes, and values are encoded with the appropriate type encoding.
Optional fields are not encoded if they are not present, which is a space optimization, but loses the ability to skip fields during deserialization.
The size limit for a message is 2 GiB, but in practice, usually 64 MiB for Google’s implementations.
Schema Evolution
Protobuf enables backwards compatibility by encoding the field number in the tag, so:
- New fields can be added, and the previous version of the code will ignore them.
- Field types can be changed to a different type, if the new type is backwards compatible.
- Field names can be changed, but the new field must use the same field number.
- Optional fields can be converted to required fields.
- Optional fields can be removed, but must not be re-used.
- Field numbers can be reserved for future use.
Schema Transport
Schema files are typically .proto
files, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.
Ergonomics
Protobuf has no required
type. Instead, there is an implicit optional
type (i.e. no optional
keyword), and an explicit optional
type.
message Exam { int32 score = 1; // Implicit optional optional int32 taken_at = 2; // Explicit optional }
If no value is provided for an implicit optional
field, it will be set to the default value for that type.
If the score
field is not present, it will be implicitly set to 0
, which is different from an actual score
of 0
. Both values are encoded the same, so you can’t tell the difference.
To avoid this, you need to set optional
on the score
field.
message Exam { optional int32 score = 1; // Explicit optional optional int32 taken_at = 2; // Explicit optional }
Only then can you tell the difference between an actual score
of 0
and a missing score
field.
Enums have the same issues, but they are handled differently depending on the language, which is even more confusing.
Lastly, protobuf has many numeric types (32-bit and 64-bit, signed and unsigned, varint and fixed integers and floats), each with their own performance characteristics.
Cap’n Proto
Cap’n Proto is a binary serialization library that was influenced by lessons learned from Protocol Buffers.
Version 1.0 was released in July 2023.
Types
Cap’n Proto supports the following types:
- Scalar types (numbers, booleans, strings, and bytes)
- Enum types
- Union types via
union
- Nested types
- List types via
List(T)
- Interface types via
interface
- Generic types e.g.
Data(T, Y)
- Map types via generics e.g.
Map(K, V)
- Null type via
Void
- Dynamic types via
AnyPointer
Encoding/Decoding
Cap’n Proto’s binary encoding is aligned per 8 bytes for efficient memory access.
Its struct is split into a pointer and data section. The pointer section contains the offsets to the data section, so:
- Data can be read/copied without knowing the schema.
- All pointers need to be read, but you can skip data for fields you don’t care about.
Because of this, Cap’n Proto is faster to read and write than Protocol Buffers, but less efficient in terms of space. If you are I/O bound, this may not be a good format, but you can enable packing to reduce the size, but at the cost of increased CPU.
Schema Evolution
Since Cap’n Proto adopts the same numeric field tags as Protocol Buffers, it supports the same schema evolution capabilities.
Schema Transport
Similar to Protocol Buffers, schema files are typically .capnp
files, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.
Ergonomics
Cap’n Proto has no required
type, and no optional
type. Instead, fields are always optional, and fall back to a default value if not present. In order to determine if a field was not set, you need to give it an invalid default value, then check if it matches the default. Or, if you don’t mind the extra overhead, change the schema to an union type with a Void
type.
This is more explicit than Protocol Buffers, but it more steps. Choosing an invalid default value can be quite hard, and accident prone.
FlatBuffers
FlatBuffers is a binary serialization library created by Google.
Types
FlatBuffers supports the following types:
- Scalar types (numbers, booleans, strings, and bytes)
- Enum types (but only integer values) via
enum
- Union types via
union
- Nested types via
nested_flatbuffer
- Array types
- Optional, required, and default values
Encoding/Decoding
FlatBuffers is similar to Cap’n Proto in that it uses a binary encoding that is aligned. The difference is that FlatBuffers aligns to the scalar type, not per 8 bytes, so it is more space efficient.
In addition, FlatBuffers uses a table-based encoding, meaning that each field is encoded with an offset to the value, with default values encoded inline. This means:
- You can skip reading fields you don’t care about.
- It is more space efficient since default values aren’t zeroed out.
However, writing data is slower since the offsets need to be calculated, and sizes need to be known ahead of time.
Schema Evolution
FlatBuffers supports the same schema evolution capabilities as Protocol Buffers and Cap’n Proto. However, instead of using field numbers, FlatBuffers requires you to preserve the order of fields, and fields can only be deprecated, not removed.
Schema Transport
Similar to Protocol Buffers and Cap’n Proto, .fbs
schema files are compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.
Ergonomics
FlatBuffers have required
fields, which are fields that must be present in the data. If they are not present, an error will be thrown during deserialization.
This is different from Protocol Buffers and Cap’n Proto, where fields are always optional, and default values are used if they are not present.
FlatBuffers’ design is great, but its documentation and language support is lacking.
Avro
Avro is a row-based serialization format developed by Apache.
Types
Avro supports the following types:
- Scalar types (numbers, booleans, strings, and bytes)
- Null type via
null
- Enum types via
{"type": "enum", "symbols": ["X", "Y"]}
- Array types via
{"type": "array", "items": "T"}
- Nested types via
{"type": "record", "fields": [{"name": "x", "type": "T"}]}
- Union types via
[X, Y]
- Fixed sized types via
{"type": "fixed", "size": 16, "name": "MD5"}
- Map types via
{"type": "map", "values": "V"}
- Date, time, timestamp, duration and UUID types
Encoding/Decoding
Avro uses binary or JSON encoding. It uses variable-length zigzag encoding for int and long types, similar to Protocol Buffers’ varint encoding. And so like Protocol Buffers, it is more space efficient, but slower to read and write.
In addition, Avro defines a sort order for data, meaning that comparisons can be made without knowing the schema, allowing for efficient sorting and indexing.
Schema Evolution
Avro supports schema evolution through the use of Avro’s schema resolution.
In addition, Avro enforces that a reader must use the schema that was used to write the data. This is different from Protocol Buffers and Cap’n Proto, where the reader can use a different schema, and the data will be deserialized using the new schema. This means that you must make sure that new schemas are available to readers, before you can write new data, and you need to keep the old schemas around to read old data.
In a small scale, Git could be used, but in a larger scale, a schema registry would be needed.
Schema Transport
Avro can use JSON for its schema, which is human-readable and can be transported with the data. But the binary encoding is more space efficient, but requires the schema to be known ahead of time.
Avro specifies a RPC protocol that can be used to transport the schema, as a handshake between the writer and reader before any data is sent.
Ergonomics
Avro has a lot of types, and fields are required by default. The enforced schema makes it more type safe, but requires a handshake, and schema management, which is a lot of overhead for simple use cases.
But for large scale systems and large data sets, it is a good choice.
Thrift
Thrift is a binary serialization library developed first by Facebook, then open sourced to Apache.
It is also a RPC framework, meaning that it can be used to define services and automatically generate client and server stubs in a variety of languages.
Types
Thrift supports the following types:
- Scalar types (numbers, booleans, strings, and bytes)
- Constants via
const
- Structs via
struct
- Unions via
union
- Exception via
exception
- Maps via
map
- Sets via
set
- Lists via
list
- Required, optional, and default values
- UUID types
- C++ types via
cpp_type
Encoding/Decoding
Thrift uses a binary encoding, similar to Protocol Buffers, but with a few differences.
First, newer versions of Thrift use a compact protocol that is more space efficient, but requires more CPU to read and write. It is similar to Protocol Buffers’ Tag-Length-Value (TLV) encoding.
Schema Evolution
Thrift enables backwards compatibility by encoding the field number in the tag, and so follows the same schema evolution capabilities as Protocol Buffers and Cap’n Proto.
Schema Transport
Thrift uses .thrift
files for its schema, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.
Ergonomics
Thrift has 3 types of fields: required, optional, and default. Required fields must be present in the data, optional fields are not present by default, and default fields are present by default, but can be overridden. Required fields throw an error if they are not present, while optional and default fields will use the default value.
Thrift also has a concept of exceptions, which are similar to errors or exceptions in programming languages. They are a way to communicate errors or other issues that may occur during a service call.
Thrift’s design is great for RPC, but it can be overkill for simple use cases.