Skip to main content
Back to IntelligenceData Engineering

The Data Lakehouse: Merging Data Lakes and Data Warehouses

The data lakehouse architecture combines cheap object storage with the performance and governance of data warehouses. Here is how it works.

E
Explicor
4 min read

The data lakehouse is an architectural pattern that has become increasingly prominent in data engineering over the past few years. It attempts to combine the best properties of two previously separate systems: data lakes (cheap, scalable, format-flexible storage) and data warehouses (fast queries, transactions, governance).

What came before: lakes and warehouses

Data warehouses like Redshift, Snowflake, and BigQuery provide fast analytical queries, ACID transactions, and strong schema enforcement. They are excellent for analytics but expensive for storing large amounts of raw or semi-structured data.

Data lakes — typically object storage like S3 or GCS — are cheap and can store any format (JSON, CSV, Parquet, images, logs). But raw object storage provides no query capability, no schema enforcement, and no transactional guarantees.

The two-tier architecture — raw data in a lake, cleaned data in a warehouse — became the standard for data teams. But it creates duplication, synchronization complexity, and the "data swamp" problem: lakes with no governance become ungovernable.

What the lakehouse adds

The lakehouse sits on top of object storage but adds a table format layer that provides:

  • ACID transactions: Multiple writers can modify the same table without corruption
  • Time travel: Query the table as it existed at any point in the past
  • Schema evolution: Add, rename, or change column types without rewriting data
  • Metadata layer: Track what data exists, in what files, with what statistics — enabling fast query planning

The three dominant open table formats implementing this are:

  • Delta Lake: Developed by Databricks, widely adopted
  • Apache Iceberg: Open format, strong vendor support from AWS and others
  • Apache Hudi: Focused on incremental processing and CDC (change data capture)

How Delta Lake works

Delta Lake is the clearest example to understand. It stores data in Parquet files (a columnar format) in object storage, with a _delta_log directory alongside that records every transaction as a JSON file.

When you write to a Delta table:

  1. New data is written as new Parquet files
  2. A new log entry is added recording what files were added, removed, or changed

When you read from a Delta table:

  1. The transaction log is read to determine the current state of the table
  2. Only the active files are read

This design gives you transactions (log entries are atomic), time travel (read the log up to any point in history), and metadata-based query optimization.

Query engines on top

The table format layer is not a query engine — it is metadata. Actual query execution is handled by separate engines:

  • Apache Spark: The most common engine for large-scale transformations
  • Trino / Presto: Distributed SQL engines that can query across many data sources
  • DuckDB: Embedded analytical database, excellent for single-machine analytics on medium-size datasets
  • Databricks SQL / Athena / BigQuery: Managed query services that can read open formats

This separation is an advantage: you can use the best engine for each task without being locked into a single vendor for storage.

The governance layer

A lakehouse still needs governance — knowing what data exists, what it means, who can access it, and how it flows. This is typically provided by data catalog tools (Apache Atlas, DataHub, AWS Glue Catalog) that track metadata about tables, schemas, lineage, and access controls.

Without governance, even a well-structured lakehouse becomes difficult to navigate as it grows.

When to use a lakehouse

A lakehouse architecture makes sense when:

  • You have diverse data types that would be expensive to warehouse (raw logs, unstructured data, large files)
  • You need to query historical snapshots of data
  • You want a single storage tier for both raw and processed data
  • You are already using Spark or similar distributed compute

A traditional data warehouse is simpler when:

  • Your data is well-structured and volume is manageable
  • Your team is primarily SQL-based
  • Operational simplicity matters more than storage cost optimization

Summary

The data lakehouse adds an open table format layer (Delta Lake, Iceberg, Hudi) on top of cheap object storage, providing ACID transactions, time travel, and schema evolution without migrating data to an expensive warehouse. Query engines connect to the table format to execute SQL. This architecture enables teams to maintain a single storage tier for raw and processed data, reducing duplication while preserving the cost advantages of object storage.

More Intelligence

Data Engineering

Data Pipelines for Software Engineers

What data pipelines are, how they are structured, and the key decisions involved in building them — from batch to streaming to hybrid approaches.

5 min
Space

The Orbital Economy: Who Controls the Pipes in Space

Launch costs have collapsed 95% since 2010. A new orbital economy is emerging around the infrastructure layer — satellites, spectrum, and the companies building the highways of space.

4 min