Database Internals¶
At a high level, DB systems involves storing vast amount of data and providing an API for querying over the data efficiently. To solve this DBs uses various strategies and data structure, the majority of which revolves around these two data structure: Table and Index.
Table¶
Table is a logical structure which defines how data is modeled and stored on disk. From outside, table is collection of rows(1) and columns(2) - usually representing an entity in your application. Internally, table is a collection of tuples (rows) organized across pages on one or more data files.
- unique record or instance of entity referenced by the table
- specifies the attribute or field of referenced row
Tuple¶
Tuple is a structured block of bytes stored inside a page, the physical realization of a row. It's composed of header and data field, with following format roughly,
| tuple header (visibility, length, flags, etc.) | col_1 value | col_2 value...|
Within the page, it's paired with a line pointer which is generated by the byte offset of Tuple relative to the start of page, similar to following:
[Page]
...
├── Line pointer → Tuple #1 (offset 40)
├── Line pointer → Tuple #2 (offset 120)
|
...
|
[ ... Tuple Data Area ... ]
With this, DB can globally identify each tuple physically within a table using following combination known as
Tuple ID -> (PageNumber, LinePointer).
For example, TID -> (PageNumber=42, LinePointer=2) would read the page number \(42\) into memory, lookup
the value of line pointer \(2\) and fetch the data pointed by line pointer.
TID allows DBs to reference rows efficiently and uniquely across various data structures like indexes.
When designing a DB system, there are two broad choices based on how this TID mapping to physical location is utilized:
-
DBs (like Postgres) where table is physically stored as a heap (1) uses
(PageNumber, LinePointer)directly since the rows are unordered.- Unordered pages of tuple
-
DBs (like MySQL InnoDB) which uses clustered indexes (1), lays the physical location of tuple using primary key. Due to this, the TID mapping mentioned above can't be utilized directly. You've to use primary key to get the physical location, which internally uses mapping similar to TID managed by storage layer.
- Tuples are stored inside the index, whose key determines the physical and logical location of tuple.
Tradeoff
You get the following summarized tradeoff based on this decision choice:
| Aspect | PostgreSQL (Internal TID) | MySQL/InnoDB (External RowID) |
|---|---|---|
| Row addressing | Physical (page, slot) | Logical (primary key) |
| Update behavior (MVCC) | New version = new TID (row moves) | Row stays; old versions kept in undo log |
| Index maintenance | Indexes point to TIDs → need update when TIDs change | Indexes point to PKs → stable, fewer updates |
| Lookup cost (secondary index) | One hop (index → heap) | Two hops (secondary → primary → data) |
| Insert performance | Fast (append to heap) | Slower (must maintain clustered order) |
| Range scan performance | Slower (heap unordered) | Faster (rows ordered by PK) |
| VACUUM / cleanup | Required to reclaim old tuples | Handled via purge of undo logs |
| Storage flexibility | Simple, flexible heap | More rigid due to clustering |
Page¶
All data structures in DBs (like Tables, collections, rows, columns, indexes, sequences, documents) end up as bytes in a page. This model allows you to decouple the storage engine from the DB frontend which is responsible for formating the data and provide an API over it. A DB page is the fundamental unit of I/O and storage inside a DB engine. It’s the smallest chunk of data the DB reads from or writes to disk or caches in memory.
But why do DBs use fixed size blocks for its read/write operations? This is due to the way disk storage (HDDs/SDDs) work and how they're different from memory (RAM).
Note
Physical Disk operates at the level of sectors (HDD) or blocks (SDD) which allow you to persist fixed chunks of data even in absence of power. Working with disk is abstracted by OS using LBA API and then File System API, which defines fixed size chunks known as Pages (usually \(4\) KB) as smallest unit to read and write from disk (defined as an I/O).
So working with disk requires you to use fixed size blocks of storage known as pages. To be not confused with OS Pages, DBs uses their own logical page definition. This approach offers several key advantages which are requirements for DB systems:
- Separation of Concern, OS is designed to manage generic files and memory, while DBs needs exact control on how data is laid out, cached, logged, and recovered. Using its own Page abstraction would allow DB system to use all such optimize optimization.
-
OS Page size defaults to \(4\) KB which works for many applications, but is inefficient of DBs since the space acquired by metadata would out weigh the useful information available. Today, DBs uses page size ranging from (\(8\)-\(16\) KB) optimized for their workload.
Small vs Large DB Pages
- Small pages are faster to read and write especially if the page size is closer to the media block size. However, the overhead cost of the page header metadata compare to useful data can get higher for smaller pages.
- Larger pages can minimize metadata overhead and page splits but at the cost of higher cold read/write.
-
This allows DBs to implement their own page cache/buffer pool as they already know which pages are "hot"(1) and want to manage them efficiently.
- When the required page is already present in memory/cache.
- You can consistently port your data to different OS platform, which make replication and backup operations simpler.
A simplified page structure on disk looks like as follows:
+----------------------------------------------------+
| Page Header (metadata) |
|----------------------------------------------------|
| Line Pointer Array (Item IDs / Slot Directory) |
|----------------------------------------------------|
| Free Space |
|----------------------------------------------------|
| Tuple Data Area (actual rows/tuples) |
|----------------------------------------------------|
| Special Space (optional, for indexes) |
+----------------------------------------------------+
- Page Header is a fixed size bytes which stores metadata like Page LSN (Log Sequence Number) -> for WAL consistency, checksums/CRC -> corruption detection, and various flags and pointers.
- Line pointer array stores offset of tuple within the page along with other metadata like tuple length or whether the tuple is dead or redirected, etc. These pointers grow downward from the header, as such the deleted/updated tuples are marks respectively without moving other tuples immediately.
- Tuple Data Area, where actual data is stored. It consists of tuple headers (like visibility info, transaction IDs) and column data. Tuple Data grows upward from bottom of page.
Note
Space occupied by dead/redirected tuples are reclaimed later by cleanup operations like VACCUM or compaction.
Role of Page Layout in forming different DB domains
The internal page layout and tuple organization fundamentally define what kind of database it is:
- Row based DBs stores complete tuples (rows) in each page. This gives you access to all columns of a row in same page, which is ideal for OLTP workloads involving frequent read/writes over entire rows.
- Columnar DBs organizes each page to store data for individual columns across many rows. Since columns are contiguously stored, such DBs are ideal for OLAP workload which involves reading few columns across large number of rows.
- Document DBs organizes their data in self-contained object (JSON/BSON) of variable length known as documents. This model provides flexible schema and faster document level read/writes.
- Graph DBs models their data as nodes and relationships (edges). Each page stores either nodes, relationship or property records, where the relationship records has direct pointers to start and end nodes, making sequence traversal fast.
- Key Value Stores (Sorted Key Pages) like RocksDB, LevelDB stores sorted key-value pairs. The storage engine (often an LSM tree) manages multiple sorted runs which makes it ideal for efficient for range scans and sequential writes.
Index¶
Another important data structure which plays crucial role in working of DBs is Index. Logically index is a data structure which allows database to quickly locate rows without scanning the entire table. Physically, they’re just files made up of fixed size pages consisting of index entries. The structure of these entries (and how pages are connected) defines the index type. Among which the most common type is B-Tree (B+Tree), which almost all general-purpose databases (Postgres, MySQL, Oracle, SQL Server, SQLite) use as standard indexes.
Note
To be not confused with B-Tree and B+Tree, modern DBs entirely uses B+Tree implementation, but you'll still find people using B-Tree naming convention from place to place.
At a high level, B+Tree index is a hierarchy of pages consisting of Root, Internal and Leaf pages (or nodes). The Root node is entry point for lookup, points to Internal nodes consisting of indexed keys -> child node mapping. At the bottom, we've Leaf nodes which consists of the index key -> TupleID mapping. Each node stores sorted list of keys and pointer which allows you to perform Binary Search on it to find the respective key in \(O(logn)\) time. Also, the leaf nodes are doubly linked to adjacent nodes, which allows you faster range queries.
Some DBs (like InnoDB) decided to store the data tuple within index itself instead of using TID. Such indexes are known as Clustered Index, as they cluster the table within index itself. Since data is stored directly in index, the I/O cost for lookup is usually 1-page read which is faster compared other its counterpart. You also get faster range queries since the rows would live within same page if the keys are sequential. But this design is severally impacted inserts, since you must keep rows physically ordered by key which can cause page splits and random I/O if the indexing key is random. Also, secondary Indexes point to primary key in clustered index. If you primary key is modified, all your secondary index needs to be updated. You also need to be aware of the primary key size, which can impact the secondary index size and performance.
To optimize index performance, try to keep the index size as small as possible, since we need to load the index pages into memory before working with them. Smaller index entries would allow us to pack more information per index page, and if such information is user managed -> just be mindful about the size of user defined field stored in such entries.