Data Serialization Formats

Published Nov 17, 2024 - 12 min read

There are a lot of data serialization formats out there. Largely, we can split them into two categories, simple serialization formats, and schema-based serialization formats.

Simple Serialization Formats

Simple serialization formats only encode data into bytes. They are self-contained, meaning they include all the information needed to parse the data.

Some examples of simple serialization formats are JSON, MessagePack, CBOR, and BSON.

Pros:

Simple to implement.
Well-understood and widely supported.
Often human-readable.

Cons:

No schema, so backwards/forwards compatibility is up to implementation, resulting in custom code or unofficial specifications like JSON Schema.
No validation of data.
Typically slower to read, parse, and write.
Typically larger in size.

The reasons for their trade-offs are easy to understand. Since there is no schema, the deserializer must read the entire message, so you can’t skip ahead, and you can’t assume the data is in any particular format. The code paths must account for all types, which makes them slower.

Since there is no schema, the data must include explicit information about types, lengths, and other metadata, which makes the data larger.

To make it more clear, let’s look at an example using JSON.

json

{
  "date": "2024-11-17T21:04:00.000Z",
  "name": "John Doe",
  "age": 30,
  "isActive": true
}

When you receive this data, you don’t know if the date is a string, a number, a boolean, or null. You need to read the value to know, then transform it into the native String type of the language you are using.

If you only care about the name field, you still need to read the entire message. There is no way to skip ahead. Just because this message has a string name field in the second position doesn’t mean the next message will too. It could be in another position, contain a different type, or be missing altogether.

Finally, the data is larger as each key is encoded as a string. If you have a large number of keys, this can add up.

However, there are some cases where simple serialization formats can be faster than schema-based serialization formats. For example, if you are sending JSON to a JavaScript client, JSON has no overhead since the client can parse them natively (running optimized compiled code), whereas using a schema-based format like Protocol Buffers requires the data to be deserialized using the JavaScript engine (or WebAssembly), which is slower.

Schema-Based Serialization Formats

Schema-based serializations require a schema to be defined before any data is serialized. This means that the schema is either known ahead of time, transported alongside the data, or exchanged between systems somehow.

Some examples of schema-based serialization formats are Protocol Buffers, Cap’n Proto, FlatBuffers, Avro, and Thrift.

Pros:

Type safety.
Data validation.
Often smaller in size.
Often faster to read, parse, and write.

Cons:

More complex to implement.
Requires a schema.
Requires schema management.

Let’s look at an example using Cap’n Proto.

Person {
  name @0 :Text;
  age @1 :Int32;
  isActive @2 :Bool;
}

When you receive the data, it will be in a binary format. The data is not human-readable, so you can’t just look at it and understand what it contains.

However, the data is type safe. If you know the schema, you know exactly what is in the message. You also know the order of the fields, the encoding used, and the size of each field, so you can skip ahead to the fields you care about. (Note: Not all schema-based formats support this.)

The data is often smaller in size because schema metadata is not included in the serialized data.

Next, we’ll look at each of these formats in more detail. For each format, we’ll look at its native types, how they are encoded, decoded, and how they handle schema evolution, schema transport, and ergonomics.

Protocol Buffers

Protocol Buffers, also known as protobuf, is a binary serialization format developed by Google. We’ll look at the version 3 specification, which is the latest at the time of this writing.

Types

Protocol Buffers supports the following types:

Scalar types (numbers, booleans, strings, and bytes)
Enum types
Union types via oneof
Nested types
List types via repeated
Map types
Optional types via optional
Dynamic types via any

Encoding/Decoding

Protobuf uses a base 128 varint encoding for all numeric types, which can be 1 to 10 bytes, with smaller numbers taking up fewer bytes.

The message structure follows a Tag-Length-Value (TLV) encoding, meaning that each field is encoded with a tag, length, and value. Tags can take up to 2 bytes, lengths up to 10 bytes, and values are encoded with the appropriate type encoding.

Optional fields are not encoded if they are not present, which is a space optimization, but loses the ability to skip fields during deserialization.

The size limit for a message is 2 GiB, but in practice, usually 64 MiB for Google’s implementations.

Schema Evolution

Protobuf enables backwards compatibility by encoding the field number in the tag, so:

New fields can be added, and the previous version of the code will ignore them.
Field types can be changed to a different type, if the new type is backwards compatible.
Field names can be changed, but the new field must use the same field number.
Optional fields can be converted to required fields.
Optional fields can be removed, but must not be re-used.
Field numbers can be reserved for future use.

Schema Transport

Schema files are typically .proto files, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.

Ergonomics

Protobuf has no required type. Instead, there is an implicit optional type (i.e. no optional keyword), and an explicit optional type.

proto

message Exam {
  int32 score = 1; // Implicit optional
  optional int32 taken_at = 2; // Explicit optional
}

If no value is provided for an implicit optional field, it will be set to the default value for that type.

If the score field is not present, it will be implicitly set to 0, which is different from an actual score of 0. Both values are encoded the same, so you can’t tell the difference.

To avoid this, you need to set optional on the score field.

proto

message Exam {
  optional int32 score = 1; // Explicit optional
  optional int32 taken_at = 2; // Explicit optional
}

Only then can you tell the difference between an actual score of 0 and a missing score field.

Enums have the same issues, but they are handled differently depending on the language, which is even more confusing.

Lastly, protobuf has many numeric types (32-bit and 64-bit, signed and unsigned, varint and fixed integers and floats), each with their own performance characteristics.

Cap’n Proto

Cap’n Proto is a binary serialization library that was influenced by lessons learned from Protocol Buffers.

Version 1.0 was released in July 2023.

Types

Cap’n Proto supports the following types:

Scalar types (numbers, booleans, strings, and bytes)
Enum types
Union types via union
Nested types
List types via List(T)
Interface types via interface
Generic types e.g. Data(T, Y)
Map types via generics e.g. Map(K, V)
Null type via Void
Dynamic types via AnyPointer

Encoding/Decoding

Cap’n Proto’s binary encoding is aligned per 8 bytes for efficient memory access.

Its struct is split into a pointer and data section. The pointer section contains the offsets to the data section, so:

Data can be read/copied without knowing the schema.
All pointers need to be read, but you can skip data for fields you don’t care about.

Because of this, Cap’n Proto is faster to read and write than Protocol Buffers, but less efficient in terms of space. If you are I/O bound, this may not be a good format, but you can enable packing to reduce the size, but at the cost of increased CPU.

Schema Evolution

Since Cap’n Proto adopts the same numeric field tags as Protocol Buffers, it supports the same schema evolution capabilities.

Schema Transport

Similar to Protocol Buffers, schema files are typically .capnp files, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.

Ergonomics

Cap’n Proto has no required type, and no optional type. Instead, fields are always optional, and fall back to a default value if not present. In order to determine if a field was not set, you need to give it an invalid default value, then check if it matches the default. Or, if you don’t mind the extra overhead, change the schema to an union type with a Void type.

This is more explicit than Protocol Buffers, but it more steps. Choosing an invalid default value can be quite hard, and accident prone.

FlatBuffers

FlatBuffers is a binary serialization library created by Google.

Types

FlatBuffers supports the following types:

Scalar types (numbers, booleans, strings, and bytes)
Enum types (but only integer values) via enum
Union types via union
Nested types via nested_flatbuffer
Array types
Optional, required, and default values

Encoding/Decoding

FlatBuffers is similar to Cap’n Proto in that it uses a binary encoding that is aligned. The difference is that FlatBuffers aligns to the scalar type, not per 8 bytes, so it is more space efficient.

In addition, FlatBuffers uses a table-based encoding, meaning that each field is encoded with an offset to the value, with default values encoded inline. This means:

You can skip reading fields you don’t care about.
It is more space efficient since default values aren’t zeroed out.

However, writing data is slower since the offsets need to be calculated, and sizes need to be known ahead of time.

Schema Evolution

FlatBuffers supports the same schema evolution capabilities as Protocol Buffers and Cap’n Proto. However, instead of using field numbers, FlatBuffers requires you to preserve the order of fields, and fields can only be deprecated, not removed.

Schema Transport

Similar to Protocol Buffers and Cap’n Proto, .fbs schema files are compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.

Ergonomics

FlatBuffers have required fields, which are fields that must be present in the data. If they are not present, an error will be thrown during deserialization.

This is different from Protocol Buffers and Cap’n Proto, where fields are always optional, and default values are used if they are not present.

FlatBuffers’ design is great, but its documentation and language support is lacking.

Avro

Avro is a row-based serialization format developed by Apache.

Types

Avro supports the following types:

Scalar types (numbers, booleans, strings, and bytes)
Null type via null
Enum types via {"type": "enum", "symbols": ["X", "Y"]}
Array types via {"type": "array", "items": "T"}
Nested types via {"type": "record", "fields": [{"name": "x", "type": "T"}]}
Union types via [X, Y]
Fixed sized types via {"type": "fixed", "size": 16, "name": "MD5"}
Map types via {"type": "map", "values": "V"}
Date, time, timestamp, duration and UUID types

Encoding/Decoding

Avro uses binary or JSON encoding. It uses variable-length zigzag encoding for int and long types, similar to Protocol Buffers’ varint encoding. And so like Protocol Buffers, it is more space efficient, but slower to read and write.

In addition, Avro defines a sort order for data, meaning that comparisons can be made without knowing the schema, allowing for efficient sorting and indexing.

Schema Evolution

Avro supports schema evolution through the use of Avro’s schema resolution.

In addition, Avro enforces that a reader must use the schema that was used to write the data. This is different from Protocol Buffers and Cap’n Proto, where the reader can use a different schema, and the data will be deserialized using the new schema. This means that you must make sure that new schemas are available to readers, before you can write new data, and you need to keep the old schemas around to read old data.

In a small scale, Git could be used, but in a larger scale, a schema registry would be needed.

Schema Transport

Avro can use JSON for its schema, which is human-readable and can be transported with the data. But the binary encoding is more space efficient, but requires the schema to be known ahead of time.

Avro specifies a RPC protocol that can be used to transport the schema, as a handshake between the writer and reader before any data is sent.

Ergonomics

Avro has a lot of types, and fields are required by default. The enforced schema makes it more type safe, but requires a handshake, and schema management, which is a lot of overhead for simple use cases.

But for large scale systems and large data sets, it is a good choice.

Thrift

Thrift is a binary serialization library developed first by Facebook, then open sourced to Apache.

It is also a RPC framework, meaning that it can be used to define services and automatically generate client and server stubs in a variety of languages.

Types

Thrift supports the following types:

Scalar types (numbers, booleans, strings, and bytes)
Constants via const
Structs via struct
Unions via union
Exception via exception
Maps via map
Sets via set
Lists via list
Required, optional, and default values
UUID types
C++ types via cpp_type

Encoding/Decoding

Thrift uses a binary encoding, similar to Protocol Buffers, but with a few differences.

First, newer versions of Thrift use a compact protocol that is more space efficient, but requires more CPU to read and write. It is similar to Protocol Buffers’ Tag-Length-Value (TLV) encoding.

Schema Evolution

Thrift enables backwards compatibility by encoding the field number in the tag, and so follows the same schema evolution capabilities as Protocol Buffers and Cap’n Proto.

Schema Transport

Thrift uses .thrift files for its schema, which are then compiled into language-specific source files. This means the schema is not transported with the data, and must be done at compile time.

Ergonomics

Thrift has 3 types of fields: required, optional, and default. Required fields must be present in the data, optional fields are not present by default, and default fields are present by default, but can be overridden. Required fields throw an error if they are not present, while optional and default fields will use the default value.

Thrift also has a concept of exceptions, which are similar to errors or exceptions in programming languages. They are a way to communicate errors or other issues that may occur during a service call.

Thrift’s design is great for RPC, but it can be overkill for simple use cases.