Skip to content

Storage and Retrieval

The previous chapter discussed how the application layer models data for a database. In this chapter, we examine how the database, or more specifically, the storage engine, models data for the layers beneath it. This topic is important because different storage engines are designed for different types of applications, particularly transactional and analytical operations.

Transactional databases typically use one of two popular categories of storage engines that are widely adopted across many DBMSs: log-structured and page-oriented storage engines.

Data structures which powers DBs

Log-structured storage engines record data in an append-only, sequential data file known as a log. For example, the simplest log-structured database can store key–value pairs separated by a delimiter in a file. To operate on this data, the database needs two basic methods:

  • set to append a key–value pair to the file
  • get to return the most recent value associated with a key

This approach optimizes set operations for fast, high-throughput writes, as it uses sequential writes which are much faster than random disk writes. However, this choice negatively impacts the performance of get operations, since the DB may need to scan the entire log to locate a key.

To improve performance get, DBs use an auxiliary data structure called an index.

Indexes are additional data structures derived from the data stored in the database. They are organized to enable efficient lookup of indexed keys in the file. Indexes significantly improve read performance, but they introduce overhead on writes because each write must also update the index. Due to this tradeoff, DBs doesn't automatically index all data, instead it allows developers to choose the indexes that best fit their application's access patterns.

Hash Index

The simplest approach is to use an in-memory hash map that maps each key to the byte offset of its corresponding value in the data file. With this approach:

  • get can directly retrieve the byte offset for a key and seek to the corresponding location on disk to fetch the value
  • set must maintain correct byte offsets when writing new values or updating existing ones

This simple design is also used by Bitcask, which is a storage engine designed for systems that require high-performance reads and writes while having keys small enough to fit entirely in memory. This model works particularly well for workloads where values change frequently compared to keys.

However, continuously appending data to a file can waste storage space by retaining duplicate and obsolete values. To address this issue, databases segment data files into fixed sizes. Once a segment reaches its size limit, a new segment file is created to record subsequent changes.

Over time, older segment files are compacted and merged by retaining only the most recent value for each key and discarding obsolete entries. To avoid blocking ongoing reads and writes, compaction is performed in the background while keeping the old segment files available. Once compaction is complete, the database deletes the old segments and switches to the newly merged segment file for processing incoming requests.

Since values are distributed across multiple segment files, each segment maintains its own in-memory hash index. To look up a key, the database starts by checking the hash indexes from the most recent segment to the oldest one. To reduce lookup overhead, the number of segment files is kept small. A few additional implementation details are important to make this approach work effectively:

  • It is faster and simpler to store values in a binary format rather than a text-based format.
  • Deletions are handled by appending a special marker called a tombstone. A tombstone indicates that the corresponding value has been deleted and can be safely removed during compaction.
  • To enable fast recovery after crashes, databases periodically snapshot the hash indexes to disk.
  • Data corruption is detected using checksums, which is especially useful for identifying partially written records.
  • Since writes are append-only, a common design choice is to use a single writer thread, while reads can be processed concurrently by multiple threads.

Advantages of appending entries instead of overwriting:

  • Appending and segment merging are sequential operations, which are significantly faster than random disk writes.
  • Concurrency control and crash recovery are simpler in an append-only design.
  • Periodic merging of old segments prevents long-term data fragmentation.

Limitations of this approach:

  • The hash index must fit in memory, which becomes impractical for a very large number of keys. An on-disk hash map is possible, but it performs poorly due to random access patterns.
  • Range queries are inefficient because there is no key locality, each key in the range must be searched for and fetched individually.

SSTables and LSM-Trees

Earlier, segment files stored values in the order they were written. With LSM-Trees, however, data is stored sorted by key, and each key appears at most once in a segment file. This storage format is called a Sorted String Table (SSTable), and it provides several advantages:

  • Merging segment files becomes simple and efficient, even when the files are larger than memory. A merge operation can use a merge sort style algorithm: read the smallest key from each segment file, write only the most recent version of each key to the output file, and repeat until all segments are merged. If a key appears in multiple segments, only the value from the most recent segment is retained.
  • A sparsely populated in-memory index can be used because segment files are sorted by key. The index stores a subset of keys, each covering a range of a few KBs in the segment file. To fetch a value whose key is directly not present in the index, the DB reads the corresponding file offsets where the key could be present and scans it to locate the required key-value.
  • Since each read typically scans multiple key–value pairs, databases can compress these ranges before writing them to disk, reducing storage usage and I/O bandwidth. The index then points to compressed blocks, which are decompressed to retrieve the contained key–value pairs.

Storing data in sorted order, however, would normally break sequential writes. To address this, storage engines use self-balancing tree data structures (such as AVL trees) that allow keys to be inserted in any order while maintaining sorted traversal. The storage engine first writes incoming data to an in-memory self-balancing tree called a memtable. When the memtable reaches a configured size, its contents are flushed to disk as an SSTable.

Read operations first check the memtable, then the most recent SSTables on disk, followed by older SSTables if needed. To recover from crashes, the database also maintains an append-only log that can be replayed to reconstruct the memtable.

LSM-Trees were originally proposed for log-structured file systems, but the idea of maintaining and compacting sorted files was later adopted by database storage engines. These are commonly referred to as LSM-based storage engines, with examples including LevelDB, RocksDB, and the storage engines used in Cassandra and HBase.

Performance optimizations

  • LSM-trees can be slow when looking up keys that are not present in the database, since the search may need to scan multiple data structures. To address this, databases use Bloom filters, which can quickly determine whether a key does not exist in the database, thereby avoiding unnecessary disk reads.
  • There are different strategies for merging and compaction. The most common ones are:
  • Size-tiered compaction, which merges newer, smaller SSTables into larger SSTables.
  • Leveled compaction, which splits the key space into multiple SSTable levels, with older data progressively moved into separate levels.

BTree

A B-Tree is the most commonly used data structure for indexing in both relational and NoSQL databases. Like SSTables, it stores sorted key–value pairs, but instead of using variable-sized segments, it organizes data into fixed-size blocks or pages (typically 4 KB).

This design aligns well with storage devices, which are also organized into fixed-size blocks. Each page can be identified by its disk address, allowing pages to reference one another directly.

Using this structure, a B-Tree index organizes indexed keys across multiple pages. Each page stores a contiguous range of keys along with references to child pages between keys, where each key defines the boundaries covered by the corresponding child page.

┌────┬─────┬────┬─────┬────┬─────┬────┬─────┬────┬─────┬────┐
│ref │ 100 │ref │ 200 │ref │ 300 │ref │ 400 │ref │ 500 │ref │
└─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┘
  │          │          │          │          │          └── key ≥ 500
  │          │          │          │          └───────────── 400 ≤ key < 500
  │          │          │          └──────────────────────── 300 ≤ key < 400
  │          │          └──────────────── 200 ≤ key < 300
  │          └── 100 ≤ key < 200              │
  └─── key < 100                              │
┌────┬─────┬────┬─────┬────┬─────┬────┬─────┬─┴──┬─────┬────┐
│ref │ 210 │ref │ 230 │ref │ 250 │ref │ 270 │ref │ 290 │ref │
└─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┘
 ...        ...        ...         │         ...        ...
                               250 ≤ key < 270
┌─────┬────┬─────┬────┬─────┬────┬─┴───┬────┬─────┬────┐
│ 250 │val │ 251 │val │ 252 │val │ 253 │val │ 254 │val │
└─────┴────┴─────┴────┴─────┴────┴─────┴────┴─────┴────┘
  • For searching a key, the lookup starts at the root node and follows the child reference that corresponds to the range in which the search key falls. This process continues down the tree until it reaches a leaf node/page, which either stores the value inline (within the leaf page) or contains a reference to an external page where the value is stored. The number of child page references from a single page is called the branching factor. It depends on the space required to store page references and key range boundaries, but it is typically on the order of hundreds.
  • To update an existing key, the corresponding leaf page is located, modified, and written back to disk.
  • To insert a new key, the value is added to the appropriate leaf page. If the leaf page, or any intermediate page, does not have enough free space to accommodate the new key, a page split occurs. This operation divides the full page into two half-full pages and recursively updates the references in the parent nodes.

    For example
    ┌────┬─────┬────┬─────┬────┬─────┬────┬──────────────────┐
    │ref │ 310 │ref │ 333 │ref │ 345 │ref │   (spare space)  │
    └─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴──────────────────┘
     ...        ...         │         ...
                         333 ≤ key < 345
    ┌─────┬────┬─────┬────┬─┴───┬────┬─────┬────┬─────┬────┐
    │ 333 │val │ 335 │val │ 337 │val │ 340 │val │ 342 │val │
    └─────┴────┴─────┴────┴─────┴────┴─────┴────┴─────┴────┘
    
    ==> Insert key 334
    
    Parent node (key promoted after split)
    
    ┌────┬─────┬────┬─────┬────┬─────┬────┬─────┬────┬──────────┐
    │ref │ 310 │ref │ 333 │ref │ 337 │ref │ 345 │ref │ (spare)  │
    └─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴─────┴─┬──┴──────────┘
     ...        ...         │          │         ...
                      333 ≤ key < 345  └──────────────────── 333 ≤ key < 345
                            │                                   │
    ┌─────┬────┬─────┬────┬─┴───┬────┬────────────────────┐     │
    │ 333 │val │ 334 │val │ 335 │val │   (spare space)    │     │
    └─────┴────┴─────┴────┴─────┴────┴────────────────────┘     │
    ┌─────┬────┬─────┬────┬─────┬────┬────────────────────┐     │
    │ 337 │val │ 340 │val │ 342 │val │   (spare space)    │─────┘
    └─────┴────┴─────┴────┴─────┴────┴────────────────────┘
    

Most databases can fit their indexes into a B-tree that is only 3-4 levels deep, which means only a small number of page references are required to locate a target page. For example, a four-level tree with 4 KB pages and a branching factor of 500 can store up to 256 TB of data.

Writing to a B-tree requires modifying existing pages, which can trigger additional writes due to page splits. This process is risky because a crash during modification can leave the index in a corrupted state. To prevent this, B-tree indexes rely on an additional data structure called a write-ahead log (WAL), also known as a redo log. The WAL records all modifications to the B-tree before they are applied to disk, allowing the database to recover to a consistent state after a crash.

In addition, updating pages in place requires careful concurrency control to prevent the tree from entering an inconsistent state. This is typically handled using lightweight locks known as latches.

Optimizations over traditional B-trees:

  • Instead of overwriting pages in place, some databases use a copy-on-write scheme to avoid inconsistent states. This approach can also help support different isolation levels for concurrency control.
  • Packing more keys into a single page increases the branching factor, which reduces the number of tree levels and improves overall B-tree performance. Some databases use abbreviated or prefix-compressed keys to save space.
  • Additional pointers are often added to improve performance. For example, leaf pages may include pointers to neighboring leaf pages, enabling faster range scans without repeatedly traversing parent pages.

Comparing B-Trees and LSM-Trees

B-trees are typically considered faster for read operations, while LSM-trees generally provide better write performance. However, these characteristics are highly workload-dependent. A few key factors should be considered when comparing the performance of different storage engines:

Write Amplification: B-trees often write data multiple times: first to the write-ahead log (WAL) and then to the data pages on disk. In addition, even small updates may require rewriting an entire page. LSM-trees also incur multiple writes due to compaction and merging of SSTables. This process—where a single application-level write results in multiple writes at the storage level—is known as write amplification.

Write amplification is a critical metric, especially for write-heavy systems, where excessive disk bandwidth consumption can become a performance bottleneck. LSM-trees typically achieve higher write throughput because they rely on sequential writes to SSTables, which generally results in lower write amplification compared to random page updates.


Compression: B-tree pages usually contain unused space to accommodate future insertions without triggering page splits. In contrast, LSM-trees do not require such slack space and are periodically compacted to eliminate fragmentation, allowing them to use storage more efficiently than B-trees.

However, ongoing compaction can sometimes interfere with concurrent read and write workloads due to limited disk bandwidth. This effect is more noticeable at higher latency percentiles, which can negatively impact average response times and throughput. In such scenarios, B-trees often exhibit more predictable performance.

Disk bandwidth must be shared between incoming writes and background compaction. For systems with high write throughput, compaction must be carefully tuned otherwise, it may fail to keep up with incoming writes, leading to an accumulation of unmerged SSTables. This, in turn, degrades read performance, as lookups must check an increasing number of SSTables.

Other Indexing Structures

Secondary Indexes: The key–value indexes discussed earlier are also known as primary key indexes in relational databases. A primary key uniquely identifies a row in a relational table, a document in a document database, or a vertex in a graph database. Other records in the database can refer to that row, document, or vertex using its primary key (or ID), and the index is used to resolve these references efficiently.

Secondary indexes can be constructed in a similar way to primary key indexes, with one important difference: secondary keys are not unique. This can be handled in two ways: either by storing a list of matching row identifiers as the index value, or by making each secondary key unique by appending a row identifier to it.


Storing Values Within the Index: Indexes store keys used for lookup, and their associated values can either be the actual row data or a reference to the row. In the latter approach, rows are stored in a separate structure called a heap file, which does not require any particular ordering. This avoids duplicating data across multiple secondary indexes and is efficient for in-place updates. However, it introduces an extra I/O operation when fetching the row after an index lookup, which can negatively impact read performance.

To avoid this additional lookup, some databases store the row data directly within the index. Such indexes are known as clustered indexes.

A middle ground between these approaches is the covering index, which stores additional columns within the index so that the database can answer certain queries directly from the index. While this improves read performance, it increases storage usage and write overhead due to data duplication.


Multi-Column Indexes: So far, we have considered indexes that map a single key to a value. This is insufficient for queries involving multiple columns. The most common approach is a concatenated index, which combines multiple columns into a single composite key by appending one column to another. This allows efficient lookups when queries specify columns in left-to-right order, but it is ineffective when queries omit the leading columns.

Multidimensional indexes provide a more general solution for queries involving multiple dimensions and are particularly important for geospatial data. Standard B-tree and LSM-tree indexes cannot efficiently answer queries over latitude and longitude ranges. One approach is to map multidimensional coordinates to a single value using a space-filling curve and index it using a regular B-tree. More commonly, specialized spatial indexes such as R-trees are used.


Full-Text Search and Fuzzy Indexes: Full-text and fuzzy indexes are useful when searching for similar keys rather than exact matches, such as handling misspelled words. Full-text search engines typically support features like ignoring grammatical variations, handling synonyms, and tolerating a limited number of character edits to accommodate typos.


In-Memory Databases: As RAM becomes cheaper, many datasets can be stored entirely in memory, potentially distributed across multiple machines. Some in-memory databases (such as those used for caching) can tolerate data loss, while others aim to provide durability using techniques such as periodic snapshots to disk, write-ahead logging, replication, or even specialized hardware like battery-backed RAM.

Beyond performance benefits, in-memory databases also enable data models that are difficult to implement efficiently using disk-based indexes. For example, systems like Redis provide database-like interfaces to data structures such as priority queues and sets.

If non-volatile memory (NVM) technologies become more widely adopted, further changes to storage engine design will likely be required. While this remains an active area of research, it is an important trend to watch going forward.

Transaction Processing or Analytics?

In the early days of business data processing, a write to a database typically corresponded to a real-world commercial transaction—such as making a sale, placing an order with a supplier, or paying an employee’s salary. Because of this usage pattern, a group of reads and writes forming a single logical unit came to be called a transaction. This terminology persisted even as databases expanded far beyond their original scope.

A transaction does not necessarily imply full ACID guarantees; rather, it is often used to describe a category of databases designed for low-latency reads and writes, as opposed to periodic batch processing jobs.

Since these transactions store and retrieve various types of data based on user input, this access pattern became known as Online Transaction Processing (OLTP). In contrast, databases also began to be used for data analytics, which exhibit very different access patterns (1), primarily scanning large volumes of data to derive insights. This usage model is referred to as Online Analytical Processing (OLAP).

  1. OLAP queries typically read a small number of columns from a very large number of rows and return computed aggregate statistics rather than individual stored values.

The major differences between OLTP and OLAP are summarized below:

Property OLTP OLAP
Main read pattern Small number of records per query, fetched by key Aggregations over large numbers of records
Main write pattern Random-access, low-latency writes from user input Bulk imports (ETL) or event streams
Primarily used by End users or customers via applications Internal analysts for decision support
What data represents Current state of data (point-in-time view) Historical data accumulated over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Data Warehousing

OLTP systems are typically required to be highly available and to process transactions with low latency, as they are often critical to day-to-day business operations. Executing OLAP queries on the same OLTP database can significantly degrade the performance of concurrently executing transactions, since analytical queries are usually expensive and require scanning large portions of the dataset.

This led to the development of data warehouses, which are separate databases designed specifically for OLAP workloads. Data warehouses allow analytical queries to run without impacting OLTP operations. They maintain a read-only copy of data from one or more OLTP systems within an organization.

Data is transferred into a data warehouse using a process called Extract–Transform–Load (ETL). This process involves extracting data from OLTP databases (either through periodic batch exports or continuous data streams), transforming it into a cleaned and analysis-friendly schema, and then loading it into the data warehouse.

An additional advantage of separating analytical workloads is that the database can be optimized specifically for OLAP access patterns. Traditional indexing structures such as B-trees and LSM-trees perform poorly for analytical queries. Although both OLTP and OLAP systems may expose a similar SQL-based interface, their internal storage layouts and access patterns are fundamentally different.

Stars and Snowflakes: Schemas for Analytics

Most data warehouses use a fairly formulaic approach to model their internal data, known as a star schema, also referred to as dimensional modeling. At the center of this model is the fact table, whose rows represent events that occurred at a specific point in time. Some columns in the fact table store attributes associated with the event, while other columns are foreign keys that reference related tables called dimension tables. Dimension tables capture the who, what, where, when, how, and why of each event.

For example
erDiagram
    FACT_SALES {
        int sales_id PK
        date sale_date
        int product_key FK
        int customer_key FK
        int store_key FK
        int quantity_sold
        float sales_amount
    }

    DIM_PRODUCT {
        int product_key PK
        string sku
        string description
        string brand_name
        string category
        string fat_content
        string package_size
    }

    DIM_CUSTOMER {
        int customer_key PK
        string customer_name
        string gender
        int age
        string loyalty_status
    }

    DIM_STORE {
        int store_key PK
        string store_name
        string city
        string state
        string region
    }

    FACT_SALES }o--|| DIM_PRODUCT : "product_key"
    FACT_SALES }o--|| DIM_CUSTOMER : "customer_key"
    FACT_SALES }o--|| DIM_STORE : "store_key"
Use mouse to pan and zoom

Sometimes even date is stored in dimension table to attach additional metadata to dates like holiday/non-holiday dates.

The term star schema comes from the way these table relationships appear when visualized: the fact table sits at the center, surrounded by dimension tables. A variation of this design is the snowflake schema, in which dimension tables are further normalized into sub-dimensions. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler and more intuitive for analysts to query and work with.

Column-Oriented Storage

Fact tables in large organizations can grow to trillions of rows, consuming petabytes of data. At this scale, efficiently querying them using OLTP-style storage engines becomes challenging. In OLTP systems, data is typically stored in a row-oriented layout, where all values from a single row are stored contiguously on disk.

Common OLAP access patterns involve selecting a small subset of columns (for example, 4–5 out of hundreds) and applying filters over large numbers of rows. When such queries are executed on row-oriented storage, the database must read entire rows from disk into memory, parse them, and then discard most of the data that is not needed for the query. This results in significant inefficiency due to excessive I/O and unnecessary data processing.

To address this problem, OLAP databases use column-oriented storage, where values from the same column across all rows are stored contiguously. When each column is stored separately, a query only needs to read and process the columns it actually references, significantly reducing I/O and computation.

Column Compression

Column-oriented storage also enables highly effective compression because column values tend to be repetitive, which makes them well suited for compression techniques. Depending on the data characteristics, different compression methods can be applied. One technique that is particularly effective in data warehouse workloads is bitmap encoding.

In many cases, the number of distinct values in a column is small compared to the total number of rows. A column with n distinct values can be represented using n separate bitmaps. Each bitmap corresponds to a distinct value, and a set bit in the bitmap indicates that the row at that position contains the corresponding value. For example,

┌───────┬──────────────────┐
│ RowID │ product_category │
├───────┼──────────────────┤
│ 1     │ Dairy            │
│ 2     │ Snacks           │
│ 3     │ Dairy            │
│ 4     │ Beverages        │
│ 5     │ Dairy            │
│ 6     │ Snacks           │
└───────┴──────────────────┘
┌─────────────────┬───────────────────┐
│   RowID →       │   1  2  3  4  5  6│
├─────────────────┼───────────────────┤
│Bitmap[Dairy]    │   1  0  1  0  1  0│
│Bitmap[Snacks]   │   0  1  0  0  0  1│
│Bitmap[Beverages]│   0  0  0  1  0  0│
└─────────────────┴───────────────────┘

If n is very small, these bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps. In such case, the bitmaps can additionally be run-length encoded as follows

1 million rows, where Beverages appears only once:

Bitmap[Beverages]: 0 0 0 0 0 ... 0 1 0 0 ... 0

RLE              : (0,523456), (1,1), (0,476543)
                    523k 0s,   1 1s, 476k 0s

Column DB Families

Cassandra and HBase have a concept of column families, which can mislead to call them column-oriented. Each column family stores all columns from a row together, along with a row key making the storage model still mostly row-oriented.


Memory Bandwidth and Vectorized Processing: A major bottleneck when scanning millions of values is the bandwidth required to move data from disk into memory and then into the CPU. OLAP databases are designed to make efficient use of memory bandwidth and CPU caches by minimizing branch mispredictions and leveraging single-instruction–multiple-data (SIMD) instructions available in modern CPUs.

In addition, CPUs can operate directly on compressed data using bitwise operations such as AND and OR through a technique known as vectorized processing. This approach is significantly faster than traditional execution models that rely on function calls and conditional branching.

Sort Order in Column Storage

Instead of storing data in arbitrary order, column-oriented databases can apply sorting using a self-balancing data structure. However, columns cannot be sorted independently, as this would break the alignment of rows across columns. Instead, a primary sort key is chosen—most commonly a timestamp or date column. Any additional sort keys further group rows within the ordering defined by the preceding keys.

Sorting provides a clear advantage by enabling much more effective compression, since similar values are stored sequentially. This compression benefit is strongest for the primary sort key and gradually diminishes for subsequent sort keys.

C-Store extension

Different queries benefit from different sort orders. Since data is often replicated across multiple machines anyway, the same dataset can be stored multiple times using different sort orders. This allows the query engine to choose the version that best matches a given query’s access pattern.

Writing to Column-Oriented Storage

Compression and sorting make in-place updates difficult in column-oriented storage. To address this, OLAP databases typically use an LSM-tree–style approach. Incoming writes are first applied to an in-memory store that maintains data in a sorted structure. Once enough data has accumulated, it is merged with the on-disk column files and written out as new files in bulk.

As a result, read queries must consider both the column data stored on disk and the more recent writes held in memory, combining the two to produce correct query results.

Aggregation: Data Cubes and Materialized Views

When multiple queries frequently aggregate the same values, it is beneficial to cache those results. One way to create such a cache is by using materialized views. In relational databases, a materialized view is defined similarly to a standard (virtual) view, which is a table-like object derived from a query.

The key difference is that a materialized view stores the query results as a physical copy on disk, whereas a virtual view is merely a query shortcut that is expanded and executed on the fly at query time.

Because materialized views store denormalized data, they must be updated whenever the underlying data changes, which adds overhead to write operations. For this reason, they are primarily used in read-heavy data warehouse workloads.

A common special case of a materialized view is a data cube, also known as an OLAP cube, which precomputes aggregates across selected dimensions.

2-Dimensional Data Cube

Each cell in the following materialized view represents an aggregate value computed by the function Sales(Product, Region).

                       Region →
         ┌───────────┬────────────────────┐
         │           │  East   West  South│
         ├───────────┼────────────────────┤
         │ Milk      │  120    100     80 │
Product ↓│ Bread     │   90    110     70 │
         │ Juice     │  150    130    120 │
         └───────────┴────────────────────┘

Using the same principle, higher-dimensional cubes can be constructed. The main advantage of a materialized data cube is that certain analytical queries become extremely fast because the results are precomputed. The trade-off is reduced flexibility, as data cubes cannot support the full range of queries that can be executed against the raw, detailed data.