Data Models and Query Languages¶

The design of data models has a profound effect on software development as well as on the way underlying problem is thought about. Now, Applications are built by layering one data model over another and for each layer, the key question is: "How should its data be represented in terms of the next lower layer?" For example,

flowchart LR
    A[Real World<br/> people, products]
    B[Application Model<br/> objects, APIs]
    C[Logical Data Model<br/> JSON, Tables]
    D[Physical Data Representation<br/> bytes in memory, disk, network]
    E[Hardware Level<br/> electrical signals, light, magnetism]

    A --> B
    B --> C
    C --> D
    D --> E

Use mouse to pan and zoom

In this approach, each layer hides the complexity of the layers beneath it by exposing a clean data model. In complex applications, this can result in multiple layers of APIs built on top of other APIs to manage and conceal complexity. These abstractions allow people from different domains (like storage engineers and application developers) to work independently at their respective layers with minial conflicts b/w each other.

In this chapter, we examine different data models used for data storage and querying, along with their assumptions about usage. These assumptions make certain operations easy, fast, and optimal, while trading off performance or support for other operations.

Relational Model vs Document Model¶

Brief History on Relational Model

The relational model was originally proposed as a theoretical approach to simplify data processing in business applications such as transaction processing (1) and batch processing (2). It was later implemented by multiple relational database management systems (RDBMSs) and standardized through SQL, which led it to become the dominant data model for much of computing history.

Examples include sales systems, banking transactions, and warehouse stock-keeping
Examples include customer invoicing, payroll processing, and reporting

The relational model organizes data into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL). Its popularity stems from its ability to hide the internal representation of data from application developers, while generalizing well as computers became more powerful and networked.

The NoSQL ("Not Only SQL")(1) data model emerged around the 2010s to address new challenges such as scalability across large datasets, high write throughput, support for specialized query patterns, and reduced schema rigidity compared to the relational model.

The term originated largely as a marketing hashtag popularized through social media

The Object-Relational Mismatch¶

Application development is mostly done using object-oriented programming languages, while SQL databases store data in relational tables. This difference leads to an awkward translation layer between the two models. This disconnect is commonly referred to as impedance mismatch. Object-Relational Mappers (ORMs), such as Hibernate, attempt to reduce the boilerplate code required for this translation, but they cannot completely hide the differences. For example, mapping a single resume object into a relational model often results in multiple tables and joins as follows.

flowchart LR
    subgraph OO[Object-Oriented Model]
        R[Resume Object]
        R --> N[Name : String]
        R --> E[Experiences : List]
        R --> S[Skills : List]
        R --> ED[Education : Object]

        E --> E1[Experience<br/>company, role, dates]
        E --> E2[Experience<br/>company, role, dates]
    end

    subgraph RDB[Relational Database Model]
        T1[Resume Table<br/>resume_id, name]
        T2[Experience Table<br/>exp_id, resume_id, company, role, dates]
        T3[Skill Table<br/>skill_id, resume_id, skill]
        T4[Education Table<br/>edu_id, resume_id, degree, school]
    end

    R ---|Mapped via ORM| T1
    E1 --- T2
    E2 --- T2
    S --- T3
    ED --- T4

Use mouse to pan and zoom

The data structure of a resume can be more naturally represented using a JSON-like structure rather than XML. This observation led to the development of DBMSs such as MongoDB and CouchDB, which support JSON-like storage models (also known as the document model). The JSON storage model also provides much better locality, since all relevant information can be fetched in a single query, rather than relying on JOINs and one-to-many mappings as in the relational model.

JSON may appear to reduce the impedance mismatch between application code and the storage layer, but it introduces its own set of problems, such as data encoding formats. These issues are discussed in the next chapter.

Many-to-One and Many-to-Many Relationships¶

When storing textual information, we usually convert the text into bytes and store it directly. However, this approach has several disadvantages when applied to standardized information (such as country, category, or status). For example, it requires maintaining consistent spelling and formatting across all entries, avoiding ambiguity when values conflict, and handling difficulty when updating the field later. To address these issues, data can be normalized by mapping such values to an ID.

Identifying information by ID helps avoid duplication across multiple entries, since the mapping can be translated back to human-readable form when required. As a result, when any information needs to be updated (for example, a city name change), it only needs to be modified in one place. A useful way to think about this distinction is to treat the information itself as human data and the ID as computer data: human data may change frequently, while the computer data can remain stable without breaking references. This also reduces write overhead when updates occur as the size of replaced field is usually smaller than original data.

While the document model naturally handles one-to-many relationships, normalizing data for many-to-one relationships is more complex due to weaker support for joins compared to relational databases. Such relationships are typically represented using foreign keys (in the relational model) or document references (in the document model), where identifiers are stored and resolved by the database at query time—either through joins or follow-up queries.

To summarize the differences between data models used by relational and document databases:

The core benefits of the document model are schema flexibility, better performance due to data locality, and in some applications—a storage model that more closely matches the application’s data model.
The relational model provides stronger join support, making it well suited for many-to-one and many-to-many relationships.

Which model allows simpler application code?

If the application’s data is naturally document-like, the document model is usually a better fit. While document databases cannot directly reference deeply nested items within a document, this limitation is rarely an issue as long as nesting is shallow.

However, when applications rely heavily on many-to-one or many-to-many relationships, the document model often avoids joins by de-normalizing data. This shifts additional responsibility to application code, which must keep duplicated data consistent. In such cases, joins effectively move into the application layer, requiring multiple queries to emulate relational joins. This increases complexity and is usually slower than joins performed by a relational database.

For highly interconnected data, the document model becomes awkward, whereas the relational model is more appropriate, and graph models are often the most natural choice.

Schema flexibility

Most document databases do not enforce a rigid schema on documents. However, this does not mean they are truly schemaless. Instead, the schema is implicitly enforced by client code, which applies structure at read time—often referred to as schema-on-read. In contrast, relational databases use schema-on-write, where the schema is explicit and the database ensures all writes conform to it.

Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write resembles static (compile-time) type checking.

This difference becomes apparent when applications need to change data formats.

With schema-on-read, applications can immediately start writing data in the new format and handle validation in code.
With schema-on-write, relational databases typically require migrations (such as ALTER TABLE) to add or modify columns, which can be slow and may require downtime (1).
1. This limitation primarily applies to MySQL with large tables, where migrations may involve copying the entire table.

If application data does not share a uniform structure, schema-on-read is advantageous. For records with consistent structure, explicit schemas help document and enforce correctness.

Data locality

In document databases, a document is usually stored as a single continuous sequence of bytes, encoded using formats such as JSON, XML, or BSON. When applications frequently access most of a document at once, this locality provides better performance than data spread across multiple tables.

However, the locality advantage applies only when a large portion of the document is accessed together. On updates, the entire document is typically rewritten, except for cases where the encoded size does not change. For this reason, it is recommended to keep documents small to avoid write amplification.

Data locality is not exclusive to document databases. Some relational systems, such as Google Spanner, support interleaving child table rows with parent table rows to achieve similar locality benefits.

Convergence of document and relational databases

Modern relational databases such as PostgreSQL and MySQL provide native support for JSON data types. Conversely, document databases like RethinkDB support relational-style joins, and MongoDB drivers can automatically resolve references.

This convergence allows the two models to complement each other, leaving it to the application to combine features in a way that best fits its requirements.

Query Languages for Data¶

Query languages for data can be broadly divided into two paradigms based on their access patterns:

Imperative languages, which provide step-by-step instructions to the computer describing how an action should be performed.
Declarative languages, which specify what result is required and leave it up to the system to decide how to execute it.

Declarative query languages are generally preferred over imperative ones because:

They are more concise and easier to work with, especially through APIs.
They hide the internal implementation details of the database engine, allowing database developers to improve or change execution strategies without requiring new versions of the query language.
They are easier to execute in parallel, since they describe the desired result rather than the algorithm to produce it. This freedom allows databases to parallelize internal execution when determining results.

Analogy using CSS and JavaScript

To change the background color of an element using CSS, you simply define a rule such as li.selected { color: blue }, and the browser applies it to all matching elements. This is a declarative approach.

In contrast, using the JavaScript DOM API requires fetching all li elements, filtering those with the selected class, and then applying the style programmatically. This approach is longer, more procedural, and harder to reason about compared to CSS.

MapReduce Querying¶

MapReduce is a programming model for processing large amounts of data in bulk by distributing computation across multiple machines. A limited form of MapReduce is also supported by some NoSQL databases, such as MongoDB, as a mechanism for performing read-only queries across multiple documents.

MapReduce combines both imperative and declarative approaches. The query logic itself is expressed using code snippets (imperative), while the processing framework repeatedly invokes this logic across the dataset (declarative). It consists of two functions, map and reduce, which must be pure (1). Because these functions have no side effects, the execution framework can run them anywhere, in any order, and safely re-run them on failure. This property enables distributed query execution and allows MapReduce to serve as a foundation for higher-level query languages, such as distributed SQL engines.

The functions can only operate on the data passed to them as input and cannot perform additional queries.

However, a key drawback of MapReduce is that developers must write two carefully coordinated JavaScript functions, which is more complex and error-prone than using a declarative query language. Additionally, the imperative nature of MapReduce prevents the query optimizer from applying many performance optimizations. As a result, MongoDB introduced aggregation pipelines, which provide a more declarative querying model, closer in spirit to SQL, while still using a JSON-based syntax.

Graph-Like Data Models¶

Many-to-many relationships are an important distinguishing feature across data models. While one-to-many relationships are most naturally handled by the document model using hierarchical structures, the same approach does not work well for many-to-many relationships. These are better handled by the relational model however, when connections within the data become more complex and highly interconnected, modeling the data as a graph is often a better choice.

A graph is represented using two kinds of objects:

Vertices (or nodes), which represent data points such as people in a social graph or junctions in a rail network.
Edges (or relationships), which represent the connections between two nodes, such as friendships in a social graph or railway lines between junctions in a rail network.

Using these structures, well-known graph algorithms can be applied efficiently. For example, shortest-path algorithms can be used to find routes between two points in a road network, while PageRank can be used to determine the popularity of web pages and rank them in search results.

Graphs are also well suited for representing heterogeneous data. For instance, a social network can be modeled as a single graph containing different types of vertices (people, locations, comments) and edges (friendships, check-ins, comments-on). Based on this flexibility, there are several related but distinct ways of structuring and querying data using graphs, which we discuss next.

Property Graphs¶

In the property graph model, each:

Vertex has a
unique ID
a set of incoming and outgoing edges
collection of key–value pairs that store its properties.
Edge has a
unique ID
references to the vertices at which it starts and ends (the tail and head vertices)
a label describing the relationship between the connected vertices
key–value pairs for storing additional properties.

Vertices and edges are typically stored on disk in a form analogous to two relational tables, one for vertices and one for edges, with the fields described above represented as columns. Some key characteristics of this model include:

No rigid schema restrictions: any type of vertex can be connected to any other vertex.
Efficient traversal: given a vertex, both its incoming and outgoing edges can be found efficiently, enabling fast graph traversal.
Labeled relationships: labels allow heterogeneous data to coexist within a single graph while keeping the data model expressive and clean.

For example, you can model regional structures within countries without committing to fine-grained hierarchies upfront, and later extend the graph to include people along with their places of residence and origin. Additional data, such as food allergies can be introduced by linking people to allergens and allergens to foods, making it easy to query which foods are safe for a given person. This illustrates the strong evolvability of the property graph model, as the data model can be extended naturally to accommodate new requirements.

The Cypher Query Language¶

Cypher is a declarative query language designed for querying property graphs. It was originally developed for the Neo4j DBMS and is optimized for expressing graph pattern matching in a concise and readable form.

Data insertion Cypher allows vertices (nodes) and edges (relationships) to be created using pattern-based expressions. The following example inserts a small location hierarchy and a person connected to it:

Cypher Query

CREATE
    (NAmerica:Location {name: 'North America', type: 'continent'}),
    (USA:Location      {name: 'United States', type: 'country'}),
    (Idaho:Location    {name: 'Idaho', type: 'state'}),
    (Lucy:Person       {name: 'Lucy'}),
    (Idaho)-[:WITHIN]->(USA)-[:WITHIN]->(NAmerica),
    (Lucy)-[:BORN_IN]->(Idaho)

Each vertex is assigned a symbolic identifier that can be reused within the same query to define relationships. Labels (Location, Person) classify vertices, while relationship types (WITHIN, BORN_IN) describe semantic connections.

Querying Graph Patterns

Cypher queries describe graph patterns to be matched, rather than specifying execution steps. For example, the following query finds all people who were born in the United States and currently live in Europe:

Cypher Query

MATCH
    (person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
    (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name

This query expresses the required relationships declaratively:

A BORN_IN relationship to any location within the United States
A LIVES_IN relationship to any location within Europe

The WITHIN*0.. syntax represents a variable-length traversal, allowing hierarchical location structures of arbitrary depth.

Equivalent queries can also be expressed by:

Filtering people based on birthplace and residence properties, or
Filtering locations first and then traversing incoming BORN_IN or LIVES_IN relationships

Because Cypher is declarative, queries do not encode execution strategies. The database query planner determines the most efficient execution plan, including traversal order and index usage, without requiring changes to the query itself.

Graph Queries in SQL

Graphs can also be represented using relational schemas and queried using SQL. However, this approach requires explicit JOIN logic and careful ordering of operations:

flowchart TD
    A[Start] --> B[Find vertex with name = United States]
    B --> C[Initialize set: in_usa = United States]
    C --> D[Traverse incoming :within edges]
    D --> E[Add discovered vertices to in_usa]
    E --> D

    A --> F[Find vertex with name = Europe]
    F --> G[Initialize set: in_europe = Europe]
    G --> H[Traverse incoming :within edges]
    H --> I[Add discovered vertices to in_europe]
    I --> H

    E --> J[From in_usa vertices<br/>follow incoming :born_in edges]
    J --> K[Set: people_born_in_usa]

    I --> L[From in_europe vertices<br/>follow incoming :lives_in edges]
    L --> M[Set: people_living_in_europe]

    K --> N[Intersect / Join]
    M --> N

    N --> O[People born in USA<br/>and living in Europe]

Use mouse to pan and zoom

Compared to Cypher, SQL-based graph queries are more verbose and expose execution details, illustrating how different data models favor different access patterns and use cases.

Triple-Stores and SPARQL¶

A triple store is conceptually similar to the property graph model. While it uses different terminology, the underlying ideas are closely related. It is worth discussing separately because of the rich ecosystem of tools and query languages built around triple stores.

In a triple store, information is represented as three-part statements, called triples, of the form (subject, predicate, object).

The subject is equivalent to a vertex (node).
The predicate represents the relationship or property name, similar to a labeled edge.
The object can be either a primitive value (such as a string or number) or another vertex in the graph.

RDF Representation Example

The same data modeled earlier using a property graph can be expressed in RDF using Turtle notation, a human-readable serialization format:

Using Turtle notation

@prefix : <urn:example:>.

_:lucy      a       :Person.
_:lucy      :name   "Lucy".
_:lucy      :bornIn _:idaho.

_:idaho     a       :Location.
_:idaho     :name   "Idaho".
_:idaho     :type   "state".
_:idaho     :within _:usa.

_:usa       a       :Location.
_:usa       :name   "United States".
_:usa       :type   "country".
_:usa       :within _:namerica.

_:namerica  a       :Location.
_:namerica  :name   "North America".
_:namerica  :type   "continent".

Each line represents a single triple, and together they form a graph of interconnected entities.

Semantic Web Background

The Semantic Web was based on a simple idea: since websites already publish information for humans to read, they could also publish structured, machine-readable data that computers can interpret. The Resource Description Framework (RDF) was designed to allow data from different websites to be published in a consistent format, enabling automatic combination into a global "web of data" effectively, an internet-scale database.

While the broader Semantic Web vision did not fully materialize, it resulted in an important and enduring contribution: the RDF data model.

Because RDF was designed for data sharing across independent systems, subjects, predicates, and objects are typically URIs. These URIs serve as globally unique identifiers, allowing data from different sources to be combined without naming conflicts. A URI does not need to resolve to a real web resource, it primarily acts as a namespace (for example, urn:example:within).

SPARQL Query Language

SPARQL is the query language used for querying RDF data stored in triple stores.

Example Query

PREFIX : <urn:example:>
SELECT ?personName WHERE {
    ?person :name ?personName.
    ?person :bornIn / :within* / :name "United States".
    ?person :livesIn / :within* / :name "Europe".
}

This query finds people who were born in the United States and currently live in Europe by matching graph patterns over RDF triples. The / operator expresses path traversal, and * allows variable-length paths, similar to recursive traversal in graph query languages

Properties vs. Relationships

Unlike property graphs, RDF does not distinguish between properties and relationships. Both are represented uniformly as predicates in triples. As a result, the same query syntax can be used to match either.

For example, the pattern ?usa :name "United States" binds the variable ?usa to any vertex that has a name predicate with the value United States. This uniformity simplifies the data model, but can also make certain distinctions less explicit compared to property graphs.

The Foundation: Datalog¶

Datalog is another query language that predates both SPARQL and Cypher. It is used in a small number of data systems, such as Datomic and Cascalog, and has historically been applied to querying large datasets on Apache Hadoop.

Datalog uses a data model similar to triple stores, but with a more generalized form. Instead of representing data as (subject, predicate, object), it represents facts as predicates of the form: predicate(subject, object)

Example: Datalog Facts and Rules

For example,

/* Data */
name(namerica, 'North America').
type(namerica, continent).
name(usa, 'United States').
type(usa, country).
within(usa, namerica).

name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).

name(lucy, 'Lucy').
born_in(lucy, idaho).

/* Querying */
within_recursive(Location, Name) :-
    name(Location, Name).                      /* Rule 1 */

within_recursive(Location, Name) :-
    within(Location, Via),
    within_recursive(Via, Name).               /* Rule 2 */

migrated(Name, BornIn, LivingIn) :-
    name(Person, Name),                        /* Rule 3 */
    born_in(Person, BornLoc),
    within_recursive(BornLoc, BornIn),
    lives_in(Person, LivingLoc),
    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe').

Rules and Evaluation

To query data in Datalog, we first define rules, which describe how new predicates can be derived from existing data or from other rules. In the example above, within_recursive and migrated are derived predicates. Rules can refer to other rules, enabling recursive definitions similar to functions calling other functions.

A rule applies only if the system can find matches for all predicates on the right-hand side of the :- operator. For example, the predicate name(Location, Name) matches the fact name(namerica, 'North America'). When a rule applies, it behaves as if the predicate on the left-hand side were added to the database.

One possible evaluation sequence is:

flowchart TD
%% Base facts
    A[name_namerica_NorthAmerica]
    B[within_usa_namerica]
    C[within_idaho_usa]
    D[born_in_lucy_idaho]
    E[lives_in_lucy_europe]

%% Rule 1
    A -->|rule1| A1[within_recursive_namerica_NorthAmerica]

%% Rule 2 applications
    A1 -->|rule2| B1[within_recursive_usa_NorthAmerica]
    B --> B1

    B1 -->|rule2| C1[within_recursive_idaho_NorthAmerica]
    C --> C1

%% Rule 3
    C1 -->|rule3| F[migrated_lucy_USA_Europe]
    D --> F
    E --> F

%% Query result
    F --> G[result_Who_lucy]

Use mouse to pan and zoom

The Datalog approach requires a different way of thinking compared to the other query languages discussed in this chapter. Instead of issuing one-off queries, users define reusable rules that describe relationships in the data.

This makes Datalog less convenient for simple, ad hoc queries, but very powerful for complex domains, where rules can be composed, reused, and evaluated efficiently over large datasets.