The data lakehouse is an architectural pattern that has become increasingly prominent in data engineering over the past few years. It attempts to combine the best properties of two previously separate systems: data lakes (cheap, scalable, format-flexible storage) and data warehouses (fast queries, transactions, governance).
What came before: lakes and warehouses
Data warehouses like Redshift, Snowflake, and BigQuery provide fast analytical queries, ACID transactions, and strong schema enforcement. They are excellent for analytics but expensive for storing large amounts of raw or semi-structured data.
Data lakes — typically object storage like S3 or GCS — are cheap and can store any format (JSON, CSV, Parquet, images, logs). But raw object storage provides no query capability, no schema enforcement, and no transactional guarantees.
The two-tier architecture — raw data in a lake, cleaned data in a warehouse — became the standard for data teams. But it creates duplication, synchronization complexity, and the "data swamp" problem: lakes with no governance become ungovernable.
What the lakehouse adds
The lakehouse sits on top of object storage but adds a table format layer that provides:
- ACID transactions: Multiple writers can modify the same table without corruption
- Time travel: Query the table as it existed at any point in the past
- Schema evolution: Add, rename, or change column types without rewriting data
- Metadata layer: Track what data exists, in what files, with what statistics — enabling fast query planning
The three dominant open table formats implementing this are:
- Delta Lake: Developed by Databricks, widely adopted
- Apache Iceberg: Open format, strong vendor support from AWS and others
- Apache Hudi: Focused on incremental processing and CDC (change data capture)
How Delta Lake works
Delta Lake is the clearest example to understand. It stores data in Parquet files (a columnar format) in object storage, with a _delta_log directory alongside that records every transaction as a JSON file.
When you write to a Delta table:
- New data is written as new Parquet files
- A new log entry is added recording what files were added, removed, or changed
When you read from a Delta table:
- The transaction log is read to determine the current state of the table
- Only the active files are read
This design gives you transactions (log entries are atomic), time travel (read the log up to any point in history), and metadata-based query optimization.
Query engines on top
The table format layer is not a query engine — it is metadata. Actual query execution is handled by separate engines:
- Apache Spark: The most common engine for large-scale transformations
- Trino / Presto: Distributed SQL engines that can query across many data sources
- DuckDB: Embedded analytical database, excellent for single-machine analytics on medium-size datasets
- Databricks SQL / Athena / BigQuery: Managed query services that can read open formats
This separation is an advantage: you can use the best engine for each task without being locked into a single vendor for storage.
The governance layer
A lakehouse still needs governance — knowing what data exists, what it means, who can access it, and how it flows. This is typically provided by data catalog tools (Apache Atlas, DataHub, AWS Glue Catalog) that track metadata about tables, schemas, lineage, and access controls.
Without governance, even a well-structured lakehouse becomes difficult to navigate as it grows.
When to use a lakehouse
A lakehouse architecture makes sense when:
- You have diverse data types that would be expensive to warehouse (raw logs, unstructured data, large files)
- You need to query historical snapshots of data
- You want a single storage tier for both raw and processed data
- You are already using Spark or similar distributed compute
A traditional data warehouse is simpler when:
- Your data is well-structured and volume is manageable
- Your team is primarily SQL-based
- Operational simplicity matters more than storage cost optimization
Summary
The data lakehouse adds an open table format layer (Delta Lake, Iceberg, Hudi) on top of cheap object storage, providing ACID transactions, time travel, and schema evolution without migrating data to an expensive warehouse. Query engines connect to the table format to execute SQL. This architecture enables teams to maintain a single storage tier for raw and processed data, reducing duplication while preserving the cost advantages of object storage.