The Data Lakehouse: Merging Data Lakes and Data Warehouses

The data lakehouse is an architectural pattern that has become increasingly prominent in data engineering over the past few years. It attempts to combine the best properties of two previously separate systems: data lakes (cheap, scalable, format-flexible storage) and data warehouses (fast queries, transactions, governance).

What came before: lakes and warehouses

Data warehouses like Redshift, Snowflake, and BigQuery provide fast analytical queries, ACID transactions, and strong schema enforcement. They are excellent for analytics but expensive for storing large amounts of raw or semi-structured data.

Data lakes — typically object storage like S3 or GCS — are cheap and can store any format (JSON, CSV, Parquet, images, logs). But raw object storage provides no query capability, no schema enforcement, and no transactional guarantees.

The two-tier architecture — raw data in a lake, cleaned data in a warehouse — became the standard for data teams. But it creates duplication, synchronization complexity, and the "data swamp" problem: lakes with no governance become ungovernable.

What the lakehouse adds

The lakehouse sits on top of object storage but adds a table format layer that provides:

ACID transactions: Multiple writers can modify the same table without corruption
Time travel: Query the table as it existed at any point in the past
Schema evolution: Add, rename, or change column types without rewriting data
Metadata layer: Track what data exists, in what files, with what statistics — enabling fast query planning

The three dominant open table formats implementing this are:

Delta Lake: Developed by Databricks, widely adopted
Apache Iceberg: Open format, strong vendor support from AWS and others
Apache Hudi: Focused on incremental processing and CDC (change data capture)

How Delta Lake works

Delta Lake is the clearest example to understand. It stores data in Parquet files (a columnar format) in object storage, with a _delta_log directory alongside that records every transaction as a JSON file.

When you write to a Delta table:

New data is written as new Parquet files
A new log entry is added recording what files were added, removed, or changed

When you read from a Delta table:

The transaction log is read to determine the current state of the table
Only the active files are read

This design gives you transactions (log entries are atomic), time travel (read the log up to any point in history), and metadata-based query optimization.

Query engines on top

The table format layer is not a query engine — it is metadata. Actual query execution is handled by separate engines:

Apache Spark: The most common engine for large-scale transformations
Trino / Presto: Distributed SQL engines that can query across many data sources
DuckDB: Embedded analytical database, excellent for single-machine analytics on medium-size datasets
Databricks SQL / Athena / BigQuery: Managed query services that can read open formats

This separation is an advantage: you can use the best engine for each task without being locked into a single vendor for storage.

The governance layer

A lakehouse still needs governance — knowing what data exists, what it means, who can access it, and how it flows. This is typically provided by data catalog tools (Apache Atlas, DataHub, AWS Glue Catalog) that track metadata about tables, schemas, lineage, and access controls.

Without governance, even a well-structured lakehouse becomes difficult to navigate as it grows.

When to use a lakehouse

A lakehouse architecture makes sense when:

You have diverse data types that would be expensive to warehouse (raw logs, unstructured data, large files)
You need to query historical snapshots of data
You want a single storage tier for both raw and processed data
You are already using Spark or similar distributed compute

A traditional data warehouse is simpler when:

Your data is well-structured and volume is manageable
Your team is primarily SQL-based
Operational simplicity matters more than storage cost optimization

Summary

The data lakehouse adds an open table format layer (Delta Lake, Iceberg, Hudi) on top of cheap object storage, providing ACID transactions, time travel, and schema evolution without migrating data to an expensive warehouse. Query engines connect to the table format to execute SQL. This architecture enables teams to maintain a single storage tier for raw and processed data, reducing duplication while preserving the cost advantages of object storage.

The Data Lakehouse: Merging Data Lakes and Data Warehouses

What came before: lakes and warehouses

What the lakehouse adds

How Delta Lake works

Query engines on top

The governance layer

When to use a lakehouse

Summary

More Intelligence

Data Pipelines for Software Engineers

Vector Databases: What They Are and Why They Matter for AI

The Orbital Economy: Who Controls the Pipes in Space