Encoding and Evolution¶
In the previous chapter, we discussed how databases handle changes to the storage data model over time. This chapter focuses on how these changes are handled in application code. In large applications, such changes cannot occur instantly because clients may still rely on an older data model that conflicts with newer changes. Server-side applications can deploy updates gradually using techniques like rolling updates to minimize downtime, but client-facing applications require manual user updates, allowing old and new code to coexist. To function smoothly in this case, the system must be both forward-compatible (1) and backward-compatible (2).
- Old code is compatible working with data written by New code
- New code is compatible working with data written by Old code
Backward compatibility is easier to achieve because newer code can explicitly handle known cases from older data. Forward compatibility, however, is more challenging, as it requires older code to safely ignore additions introduced by newer code. Different communication protocols address this problem in their own ways using various encoding formats such as JSON, ProtoBuf, Thrift, and Avro.
Format for Encoding Data¶
Applications typically work with two different data formats:
- In-memory data structures optimized for CPU operations (such as lists, hash maps, etc.)
- Network or storage formats, which represent data as a sequence of bytes
It is the program’s responsibility to translate between these formats. Converting in-memory data to bytes is called encoding (or marshaling/serialization), while the reverse process is called decoding (or unmarshalling/deserialization). This need has led to the development of various message formats and libraries, each optimized for specific use cases.
Language-Specific Formats¶
Most programming languages conveniently provide built-in libraries for encoding and decoding (such as Java Serialization and Python pickle). However, using them introduces several problems:
- The system becomes tightly coupled to a specific programming language, along with any other system that needs to integrate with it.
- Decoding often involves instantiating arbitrary classes, which can create security risks if an attacker is able to trigger remote code execution.
- There is usually no built-in support for forward or backward compatibility.
- These libraries often have poor efficiency and performance.
Because of these issues, it is generally recommended to use such libraries only for temporary purposes, such as storing intermediate results locally.
JSON, XML, and Binary Variants¶
Using a standard format that can be implemented across different environments is more beneficial than relying on custom formats. This led to the development of widely adopted formats such as JSON, XML, and CSV. These are text-based formats (1), which makes them partially human-readable. JSON is especially popular due to its built-in support in web browsers.
- Data is encoded by translating individual characters into bytes using encodings such as Unicode.
Issues with text-based formats:
-
Ambiguous number encoding: XML and CSV cannot distinguish between numbers and digit-only strings without an external schema. JSON improves on this but does not differentiate between integers and floating-point numbers.
Integers \(\gt 2^{53}\) aren't accurate in languages like JavaScript
Many languages such as JavaScript and Python store numbers as IEEE 754 double-precision floats. These represent values in the form \((1.\text{mantissa}) \times 2^{\text{exponent}}\) using 64 bits: 1 sign bit, 11 exponent bits, and 52 mantissa bits. Because the leading 1 is implicit, doubles can accurately represent integers only up to \(2^{53}\).
Beyond this limit, not all integers are representable, causing values to be rounded. For example, \(2^{53} + 1 \approx 2^{53}\), while \(2^{53} + 2\) is representable. As values grow, the gap between representable numbers increases. A common workaround is to encode large integers as decimal strings and parse them explicitly.
-
No native support for binary data: Binary values must be encoded as text (e.g., using Base64), which increases data size by about 33% and typically requires a schema to interpret correctly.
-
Limited schema support: XML and JSON offer optional schema mechanisms, but they are complex and rarely used in practice (XML schemas are more common than JSON schemas). CSV provides no schema support at all.
Despite these limitations, JSON, XML, and CSV are sufficient for many use cases. They are likely to remain popular, especially for data exchange between organizations, where the difficulty of getting different organizations to agree on anything outweighs most other concerns.
Binary encoding
Binary formats use significantly less space and offer faster encoding and decoding than text-based formats. This has led to the development of binary extensions of JSON and XML, such as MessagePack, BSON, and WBXML. In addition to supporting more data types, these extensions retain the same data model as their original formats. For example, MessagePack (a binary variant of JSON) still encodes field names as strings rather than relying on a schema. You can compare the following byte representations of the same data to better understand the differences between these encoding formats.
JSON Byte representation
=> JSON, encoded using UTF-8 and represented in hexadecimal. Before encoding, the data is cleaned and minified to remove unnecessary characters such as spaces, newlines, and tabs. This results in a size of 51 bytes.
┌────┐
│ 7b │ '{'
└────┘
┌──────────────────────┐ ┌────┐ ┌──────────────────────┐ ┌────┐
│ 22 6e 61 6d 65 22 │ │ 3a │ │ 22 4a 6f 68 6e 22 │ │ 2c │
└──────────────────────┘ └────┘ └──────────────────────┘ └────┘
"name" ':' "John" ','
┌──────────────────┐ ┌────┐ ┌──────────┐ ┌────┐
│ 22 61 67 65 22 │ │ 3a │ │ 32 33 │ │ 2c │
└──────────────────┘ └────┘ └──────────┘ └────┘
"age" ':' "23" ','
┌──────────────────────────┐ ┌────┐ ┌────┐
│ 22 73 6b 69 6c 6c 73 22 │ │ 3a │ │ 5b │
└──────────────────────────┘ └────┘ └────┘
"skills" ':' '['
┌──────────────────────┐ ┌────┐
│ 22 63 6f 64 65 22 │ │ 2c │
└──────────────────────┘ └────┘
"code" ','
┌──────────────────────────┐ ┌────┐
│ 22 64 65 73 69 67 6e 22 │ │ 5d │
└──────────────────────────┘ └────┘
"design" ']'
┌────┐
│ 7d │
└────┘
'}'
MessagePack Byte representation
=> MessagePack encodes data using a [type + length] [raw bytes] structure and takes 36 bytes.
┌────┐
│ 83 │
└────┘
object (3 entries)
┌────┐ ┌─────────────┐ ┌────┐ ┌─────────────┐
│ a4 │ │ 6e 61 6d 65 │ │ a4 │ │ 4a 6f 68 6e │
└────┘ └─────────────┘ └────┘ └─────────────┘
string "name" string "John"
(length 4) (length4)
┌────┐ ┌──────────┐ ┌────┐ ┌─────────────────────────┐
│ a3 │ │ 61 67 65 │ │ d3 │ │ 00 00 00 00 00 00 00 17 │
└────┘ └──────────┘ └────┘ └─────────────────────────┘
string(3) "age" int64 23
┌────┐ ┌───────────────────┐ ┌────┐
│ a6 │ │ 73 6b 69 6c 6c 73 │ │ 92 │
└────┘ └───────────────────┘ └────┘
string(6) "skills" array(2)
┌────┐ ┌──────────────┐
│ a4 │ │ 63 6f 64 65 │
└────┘ └──────────────┘
string(4) "code"
┌────┐ ┌───────────────────┐
│ a6 │ │ 64 65 73 69 67 6e │
└────┘ └───────────────────┘
string(6) "design"
From this comparison, the space savings of MessagePack over JSON (51 bytes vs. 36 bytes) is relatively small, which may not justify sacrificing human readability in many cases.
Thrift and Protocol Buffers¶
MessagePack cannot save much space because it still needs to encode field names as strings within the message. This overhead could be avoided by using a schema, as done by Thrift and Protocol Buffers.
Both Thrift and Protocol Buffers (Protobuf) require a schema to translate data. This schema defines the structure and field identifiers used during encoding. Each provides code-generation tools that use the schema to generate classes, which can then be used directly for serialization and deserialization.
Thrift =>
struct Person {
1: required string name,
2: optional i8 age,
3: optional list<string> skill
}
Protobuf =>
message Person {
required string name = 1;
optional int32 age = 2;
repeated string skill = 3;
}
Thrift supports multiple encoding formats, two of the most commonly used being BinaryProtocol and CompactProtocol.
BinaryProtocol encodes each field as [field_type: 1 byte][field_id: int16][value…] and terminates the struct with a
(0x00) byte. Unlike MessagePack, field names are replaced with numeric field tags defined in the schema,
allowing BinaryProtocol to encode the data in 49 bytes.
Thrift-BinaryProtocol
┌────┐ ┌───────┐ ┌─────────────┐ ┌─────────────┐
│ 0b │ │ 00 01 │ │ 00 00 00 04 │ │ 4a 6f 68 6e │
└────┘ └───────┘ └─────────────┘ └─────────────┘
type 11 field tag length 4 "John"
(string) 1
┌────┐ ┌───────┐ ┌─────────────────────────┐
│ 0a │ │ 00 02 │ │ 00 00 00 00 00 00 00 17 │
└────┘ └───────┘ └─────────────────────────┘
type 10 field tag 23
(i64) 2
┌────┐ ┌───────┐ ┌────┐ ┌────────────┐
│ 0f │ │ 00 03 │ │ 0b │ │ 00 00 00 02│
└────┘ └───────┘ └────┘ └────────────┘
type 15 field tag item type list size 2
(list) 3 11 (string)
┌─────────────┐ ┌─────────────┐
│ 00 00 00 04 │ │ 63 6f 64 65 │
└─────────────┘ └─────────────┘
length 4 "code"
┌─────────────┐ ┌───────────────────┐ ┌────┐
│ 00 00 00 06 │ │ 64 65 73 69 67 6e │ │ 00 │ end of struct
└─────────────┘ └───────────────────┘ └────┘
length 6 "design"
CompactProtocol further reduces size by encoding the field type and tag into a single byte and by using variable-length integers for numeric values (1). Smaller numbers can be stored in a single byte, while larger numbers use additional bytes. Using this approach, CompactProtocol encodes the same data in 23 bytes.
- The most significant bit of each byte indicates whether additional bytes are needed to represent the number.
Thrift-CompactProtocol
┌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐ ┌─────────────┐
╎ 0 0 0 1 ╎ 1 0 0 0 ╎ -> │ 18 │ │ 04 │ │ 4a 6f 68 6e │
└╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌╌╌┘ └────┘ └────┘ └─────────────┘
field type 8 tag+type length 4 "John"
tag = 1 (string)
┌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐
╎ 0 0 0 1 ╎ 0 1 1 0 ╎ -> │ 16 │ │ 2e │
└╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌╌╌┘ └────┘ └────┘
field type 6 tag+type zigzag-varint(23)
tag += 1 (i64)
┌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐ ┌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌╌╌┐
╎ 0 0 0 1 ╎ 1 0 0 1 ╎ -> │ 19 │ │ 28 │ <--- ╎ 0 0 1 0 ╎ 1 0 0 0 ╎
└╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌╌╌┘ └────┘ └────┘ └╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌╌╌┘
field type 9 tag+type 2 list item type 8
tag += 1 (list) items (string)
┌────┐ ┌─────────────┐
│ 04 │ │ 63 6f 64 65 │
└────┘ └─────────────┘
length 4 "code"
┌────┐ ┌───────────────────┐ ┌────┐
│ 06 │ │ 64 65 73 69 67 6e │ │ 00 │ end of struct
└────┘ └───────────────────┘ └────┘
length 6 "design"
Protobuf uses a similar approach, encoding data as [field key][value], where
field key = (field_number << 3) | wire_type. Here, field_number is the field identifier from the schema,
and wire_type represents the data type. Unlike Thrift, Protobuf does not use an explicit end marker, does not require
fields to be ordered, and represents repeated fields as repeated key–value pairs. As a result, Protobuf also encodes
the same message in 23 bytes.
Protobuf
┌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐ ┌─────────────┐
╎ 0 0 0 0 1 ╎ 0 1 0 ╎ ->│ 0a │ │ 04 │ │ 4a 6f 68 6e │
└╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌┘ └────┘ └────┘ └─────────────┘
field 1 wire type 2 key length 4 "John"
(string)
┌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐
╎ 0 0 0 1 0 ╎ 0 0 0 ╎ -> │ 10 │ │ 17 │
└╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌┘ └────┘ └────┘
field 2 wire type 0 key varint(23)
(varint)
┌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐ ┌─────────────┐
╎ 0 0 0 1 1 ╎ 0 1 0 ╎ -> │ 1a │ │ 04 │ │ 63 6f 64 65 │
└╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌┘ └────┘ └────┘ └─────────────┘
field 3 wire type 2 key length 4 "code"
(string)
┌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌╌╌╌╌┐ ┌────┐ ┌────┐ ┌───────────────────┐
╎ 0 0 0 1 1 ╎ 0 1 0 ╎ -> │ 1a │ │ 06 │ │ 64 65 73 69 67 6e │
└╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌╌╌╌╌┘ └────┘ └────┘ └───────────────────┘
field 3 wire type 2 key length 6 "design"
(string)
How well do these formats adapt to evolution?
| Change | Forward Compatibility (old readers, new writers) | Backward Compatibility (new readers, old writers) |
|---|---|---|
| Adding a new field | Old readers safely ignore unknown field IDs | New readers treat missing fields as unset or default |
| Removing a field | Old readers ignore the missing field | Safe only if the field was optional |
| Renaming a field | Safe — field names are not encoded on the wire | Safe — field numbers/IDs remain unchanged |
| Changing a datatype | Unsafe — wire encoding changes | Unsafe — old data cannot be decoded correctly |
| Changing required/optional | Unsafe — old writers may omit required data | Unsafe — new readers may reject or mis-handle data |
Avro¶
Avro is also a binary encoding format, but it takes a different approach from Thrift (1) and Protocol Buffers. Like them, it uses a schema to define the encoding structure, but it supports two schema representations: Avro IDL (2) for human editing and a JSON-based format for machine use.
- Avro originated as a subproject of Hadoop, where Thrift was found to be insufficient.
- Interface Definition Language.
Avro IDL =>
record Person {
string name;
union { null, long } age = null;
array<string> skills;
}
JSON equivalent =>
{
"type": "record",
"name": "Person",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": ["null", "long"], "default": null},
{"name": "skills", "type": {"type": "array", "items": "string"}}
]
}
Unlike Thrift and Protobuf, Avro does not associate tags with fields, nor does it encode field names in the serialized data. When encoding the same example, Avro requires only 18 bytes. The encoded data is simply a concatenation of values, with no embedded field IDs or data types. As a result, decoding requires reading the data in the exact order defined by the schema, which specifies both field order and types.
Avro
┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌┐ ┌────┐ ┌─────────────┐
╎ 0 0 0 0 0 1 0 ╎ 0 ╎-> │ 04 │ │ 4a 6f 68 6e │
└╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌┘ └────┘ └─────────────┘
length 4 sign "John"
┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌┐ ┌────┐ ┌────┐
╎ 0 0 0 0 0 0 1 ╎ 0 ╎-> │ 02 │ │ 2e │
└╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌┘ └────┘ └────┘
union branch 1 (long, not null) zigzag-varint(23)
length 3 sign
┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌┐ ┌────┐
╎ 0 0 0 0 0 0 1 ╎ 0 ╎-> │ 02 │
└╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌┘ └────┘
block count 2
┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌┐ ┌────┐ ┌─────────────┐
╎ 0 0 0 0 0 1 0 ╎ 0 ╎-> │ 04 │ │ 63 6f 64 65 │
└╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌┘ └────┘ └─────────────┘
length 4 "code"
┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┬╌╌╌┐ ┌────┐ ┌───────────────────┐ ┌────┐
╎ 0 0 0 0 0 1 1 ╎ 0 ╎-> │ 06 │ │ 64 65 73 69 67 6e │ │ 00 │
└╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┴╌╌╌┘ └────┘ └───────────────────┘ └────┘
length 6 "design" end of array
How, then, does Avro remain compatible with schema changes?
Avro uses separate writer schemas (1) and reader schemas (2) during encoding and decoding. These schemas do not need to be identical, but they must be compatible. The Avro library resolves differences between the writer and reader schemas automatically. For example, fields are reordered as needed, fields present in the writer but missing in the reader are ignored, and fields present in the reader but absent in the writer are filled with default values defined in the reader schema (see the documentation for full details).
- The schema available to the application, typically compiled into the app.
- The expected schema of incoming data.
To maintain forward compatibility (new writers, old readers) and backward compatibility (new readers, old writers):
- Fields may be added or removed only if default values are provided.
- Datatypes may be changed, provided Avro can perform the type conversion.
- Fields may be renamed using aliases in the reader schema, which supports backward compatibility.
How does the reader obtain the writer’s schema for comparison? This depends on the context:
- When storing many records in a large file (e.g., Hadoop), the writer schema is stored once at the beginning of the file using Avro object container files.
- When storing records in a database, where schemas may evolve over time, the writer schema version can be stored with each record, while all historical schema versions are maintained separately.
- When transmitting records over the network, the writer and reader can negotiate schemas during connection setup (Avro RPC).
Dynamically Generated Schemas
A key reason Avro does not use field IDs is to support dynamically generated schemas. For example, when exporting a database to a binary format, an Avro schema can be generated automatically from table column names. The data can then be encoded using this schema and stored in an Avro object container file that includes the writer schema. Changes to the database schema can be handled automatically, without manual schema updates.
Code Generation in Dynamically Typed Languages
Schema-based code generation is especially valuable in statically typed languages, where it enables compile-time type checking. In dynamically typed languages, however, code generation often adds an extra compilation step without providing the same benefits. With Avro object container files, applications can work directly with the data while the underlying library manages the schema, abstracting these details away from the developer.
The Merits of Schemas¶
- Encoded data consumes significantly less space compared to non-schema formats such as MessagePack.
- Schemas serve as living documentation backed by real application usage, making it easier to validate forward and backward compatibility by comparing new changes against previously defined schemas.
- In statically typed languages, code generation enables compile-time type checking and more efficient data structures.
Modes of Dataflow¶
Forward and backward compatibility are properties of the processes that encode and decode data shared through files or networks. Below are some of the most common dataflow scenarios.
Dataflow Through Databases¶
The process that writes to a database encodes the data, while the process that reads from it decodes the data. Since new code must always be able to read existing data, backward compatibility is essential. In addition, databases often have multiple readers running different versions of code simultaneously, which can result in old readers consuming new data-requiring forward compatibility as well.
A common edge case is preserving unknown fields. While this can often be handled in application code, it may lead to data loss if those fields are inadvertently dropped by the message format.
Databases allow updates to any value at any time, including data written using older schemas. Updating such data often requires schema migrations, which can be expensive. Most DBMSs support schema evolution by allowing new columns with default null values, so older records automatically return null when read.
Databases also require regular backups for security, archival, or analytical purposes. In these cases, data is typically dumped in the current format, even if it contains records written using older schemas. Formats such as Avro object container files are well suited here, as they store data together with its schema.
Dataflow Through Services: REST and RPC¶
When processes communicate over a network, the most common approach is a client–server architecture. Server processes expose APIs (or services) that clients use to make requests. On the web, clients are usually browsers that rely on standardized protocols (HTTP, TLS) and formats (HTML, URLs, JSON). Applications may also define custom protocols using HTTP as the transport layer.
In a microservice architecture, servers can themselves act as clients to other services. The primary goal is to split large applications into smaller, independently deployable services, which requires APIs to remain compatible across versions.
Two common approaches to building web services are REST and SOAP:
- REST is a design philosophy built on HTTP, using URLs to identify resources and simple data formats along with standard HTTP features.
- SOAP is an XML-based protocol for network communication. Although commonly used over HTTP, it defines its own extensive set of standards through web service frameworks.
SOAP APIs are described using the Web Services Description Language (WSDL), which enables code generation. However, WSDL is not human-readable, and SOAP APIs tend to rely heavily on vendor-specific tools, making interoperability harder for unsupported languages. RESTful APIs, by contrast, are simpler and typically require less code generation.
Both REST and SOAP originated from the idea of Remote Procedure Calls (RPC), which aim to make remote interactions resemble local function calls. This model is fundamentally flawed for several reasons (1), yet modern RPC frameworks such as gRPC—built on formats like Thrift and Protobuf—continue to evolve. These frameworks make network boundaries explicit, for example by using futures that may fail. As a result, RPC is most commonly used between services owned by the same organization, while REST is more often used for public APIs.
- Issues such as unpredictability, timeouts, network failures, and semantic mismatches.
Evolvability is critical so that clients and servers can be deployed independently. In practice, servers are usually updated before clients, which requires forward compatibility for responses and backward compatibility for requests. In RPC systems, this behavior is largely determined by the encoding format used for requests and responses.