You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

Data format example (Delta)

Storage = Parquet files + Transaction log

Format	Native Library	Polars	DuckDB
Delta	`deltalake`
Iceberg	`pyiceberg`
Hudi	`hudi`	-	-
DuckLake	-	-

Format

Native Library

Polars

DuckDB

Delta

deltalake

Iceberg

pyiceberg

Hudi

hudi

DuckLake

>>> from deltalake import DeltaTable, Field, Schema >>> >>> weather_table_uri = ".datasets/weather" >>> table = DeltaTable.create( weather_table_uri, storage_options=None, schema=Schema( [ Field("time", "timestamp"), Field("city", "string"), Field("temperature", "float"), ] ), name="Weather", description="Forecast weather data", )

>>> str(table.metadata()) Metadata( id: '830c7cf1-f8f8-4c59-b3f7-369d93d914ca', name: Weather, description: 'Forecast weather data', partition_columns: [], created_time: 1758725496285, configuration: {} )

>>> import pandas as pd >>> from deltalake import write_deltalake >>> >>> weather_df_1 = pd.DataFrame( [ {"time": "2025-09-30T12:00:00Z", "city": "Paris", "temperature": 10.0}, {"time": "2025-09-30T13:00:00Z", "city": "Paris", "temperature": 11.0}, {"time": "2025-09-30T14:00:00Z", "city": "Paris", "temperature": 12.0}, ] ) >>> write_deltalake(weather_table_uri, weather_df_1, mode="append", storage_options=None)

.datasets/weather ├── _delta_log │ ├── 00000000000000000000.json │ └── 00000000000000000001.json └── part-00001-4f6cdffe-981b-4157-b19b-7fba04b1f7a6-c000.snappy.parquet

>>> weather_df_2 = pd.DataFrame( [ {"time": "2025-09-30T13:00:00Z", "city": "Paris", "temperature": 12.0}, {"time": "2025-09-30T14:00:00Z", "city": "Paris", "temperature": 13.0}, {"time": "2025-09-30T15:00:00Z", "city": "Paris", "temperature": 14.0}, ] ) >>> table = DeltaTable(weather_table_uri, storage_options=None) >>> ( table.merge( source=weather_df_2, source_alias="source", target_alias="target", predicate="target.time = source.time and target.city = source.city", ) .when_matched_update(updates={"temperature": "source.temperature"}) .when_not_matched_insert( updates={"time": "source.time", "city": "source.city", "temperature": "source.temperature"} ) .execute() )

.datasets/weather ├── _delta_log │ ├── 00000000000000000000.json │ ├── 00000000000000000001.json │ └── 00000000000000000002.json ├── part-00001-4f6cdffe-981b-4157-b19b-7fba04b1f7a6-c000.snappy.parquet ├── part-00001-d7036469-24e9-4362-9871-9a3641365b29-c000.snappy.parquet └── part-00001-f06d4ec1-4545-4844-976c-c80d31bba1dd-c000.snappy.parquet

You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

.datasets/weather/part-00001-4f6cdffe-981b-4157-b19b-7fba04b1f7a6-c000.snappy.parquet
┌──────────────────────────┬─────────┬────────────────┐
│           time           │  city   │  temperature   │
│ timestamp with time zone │ varchar │     float      │
├──────────────────────────┼─────────┼────────────────┼
│ 2025-09-30 12:00:00+00   │ Paris   │           10.0 │
│ 2025-09-30 13:00:00+00   │ Paris   │           11.0 │
│ 2025-09-30 14:00:00+00   │ Paris   │           12.0 │
└──────────────────────────┴─────────┴────────────────┘

.datasets/weather/part-00001-d7036469-24e9-4362-9871-9a3641365b29-c000.snappy.parquet
┌──────────────────────────┬─────────┬────────────────┐
│           time           │  city   │  temperature   │
│ timestamp with time zone │ varchar │     float      │
├──────────────────────────┼─────────┼────────────────┤
│ 2025-09-30 13:00:00+00   │ Paris   │           12.0 │
│ 2025-09-30 14:00:00+00   │ Paris   │           13.0 │
│ 2025-09-30 12:00:00+00   │ Paris   │           10.0 │
└──────────────────────────┴─────────┴────────────────┘

.datasets/weather/part-00001-f06d4ec1-4545-4844-976c-c80d31bba1dd-c000.snappy.parquet
┌──────────────────────────┬─────────┬────────────────┐
│           time           │  city   │  temperature   │
│ timestamp with time zone │ varchar │     float      │
├──────────────────────────┼─────────┼────────────────┤
│ 2025-09-30 15:00:00+00   │ Paris   │      14.0      │
└──────────────────────────┴─────────┴────────────────┘

>>> table = DeltaTable(weather_table_uri, storage_options=None) >>> table.version() 2 >>> table.to_pandas() time city temperature 0 2025-09-30 15:00:00+00:00 Paris 14.0 1 2025-09-30 13:00:00+00:00 Paris 12.0 2 2025-09-30 14:00:00+00:00 Paris 13.0 3 2025-09-30 12:00:00+00:00 Paris 10.0

>>> table.history() [ { 'timestamp': 1758720246806, 'operation': 'CREATE TABLE', 'operationParameters': { 'protocol': '{"minReaderVersion":1,"minWriterVersion":2}', 'mode': 'ErrorIfExists', 'location': 'file:///.../.datasets/weather', 'metadata': '{"configuration":{},"createdTime":1758720246797...}' }, 'engineInfo': 'delta-rs:py-1.1.0', 'clientVersion': 'delta-rs.py-1.1.0', 'version': 0 } ... ]

>>> table.history() [ ... { 'timestamp': 1758720703062, 'operation': 'WRITE', 'operationParameters': {'mode': 'Append'}, 'engineInfo': 'delta-rs:py-1.1.0', 'clientVersion': 'delta-rs.py-1.1.0', 'operationMetrics': { 'execution_time_ms': 142, 'num_added_files': 1, 'num_added_rows': 3, 'num_partitions': 0, 'num_removed_files': 0 }, 'version': 1 } ... ]

>>> table.history() [ ... { 'timestamp': 1758726633699, 'operation': 'MERGE', 'operationParameters': {...}, 'readVersion': 1, 'engineInfo': 'delta-rs:py-1.1.0', 'operationMetrics': { 'execution_time_ms': 45, 'num_output_rows': 4, 'num_source_rows': 3, 'num_target_files_added': 2, 'num_target_files_removed': 1, 'num_target_files_scanned': 1, 'num_target_files_skipped_during_scan': 0, 'num_target_rows_copied': 1, 'num_target_rows_deleted': 0, 'num_target_rows_inserted': 1, 'num_target_rows_updated': 2, 'rewrite_time_ms': 10, 'scan_time_ms': 0 }, 'clientVersion': 'delta-rs.py-1.1.0', 'version': 2 } ]

>>> table.load_as_version(0) >>> table.to_pandas() Empty DataFrame Columns: [time, city, temperature] Index: [] >>> table.load_as_version(1) >>> table.to_pandas() time city temperature 0 2025-09-30 12:00:00+00:00 Paris 10.0 1 2025-09-30 13:00:00+00:00 Paris 11.0 2 2025-09-30 14:00:00+00:00 Paris 12.0 >>> table.load_as_version(2) >>> table.to_pandas() time city temperature 0 2025-09-30 15:00:00+00:00 Paris 14.0 1 2025-09-30 13:00:00+00:00 Paris 12.0 2 2025-09-30 14:00:00+00:00 Paris 13.0 3 2025-09-30 12:00:00+00:00 Paris 10.0

$ from delta_scan('.datasets/weather'); ┌──────────────────────────┬─────────┬────────────────┐ │ time │ city │ temperature │ │ timestamp with time zone │ varchar │ float │ ├──────────────────────────┼─────────┼────────────────┤ │ 2025-09-30 15:00:00+00 │ Paris │ 14.0 │ │ 2025-09-30 13:00:00+00 │ Paris │ 12.0 │ │ 2025-09-30 14:00:00+00 │ Paris │ 13.0 │ │ 2025-09-30 12:00:00+00 │ Paris │ 10.0 │ └──────────────────────────┴─────────┴────────────────┘

>>> import duckdb >>> >>> weather_ds = table.to_pyarrow_dataset() >>> conn = duckdb.connect() >>> conn.register("weather", weather_ds) >>> conn.execute("select * from weather").df() time city temperature 0 2025-09-30 15:00:00+00:00 Paris 14.0 1 2025-09-30 13:00:00+00:00 Paris 12.0 2 2025-09-30 14:00:00+00:00 Paris 13.0 3 2025-09-30 12:00:00+00:00 Paris 10.0

You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

What's a Data Lakehouse?

What's a Data Lakehouse?

Data Lakehouse Features

Data format example (Delta)

Why (not) Spark?

Getting started with Python

Example Delta with `deltalake`

Creating a table

Creating a table

Inspecting table metadata

Inspecting table schema

Writing to a table

Writing to a table

Writing to a table

Writing to a table

Reading a table

Retrieve the table history

Retrieve the table history

Retrieve the table history

Time-travel

Example Delta with `duckdb`

Scanning a Delta table

Interoperability with `deltalake`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Takeaway

Romain CLEMENT

Questions ?

References

You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

What's a Data Lakehouse?

What's a Data Lakehouse?

Data Lakehouse Features

Data format example (Delta)

Why (not) Spark?

Getting started with Python

Example Delta with deltalake

Creating a table

Creating a table

Inspecting table metadata

Inspecting table schema

Writing to a table

Writing to a table

Writing to a table

Writing to a table

Reading a table

Retrieve the table history

Retrieve the table history

Retrieve the table history

Time-travel

Example Delta with duckdb

Scanning a Delta table

Interoperability with deltalake

Introducing laketower

Introducing laketower

Introducing laketower

Introducing laketower

Introducing laketower

Introducing laketower

Takeaway

Romain CLEMENT

Questions ?

References

Example Delta with `deltalake`

Example Delta with `duckdb`

Interoperability with `deltalake`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`

Introducing `laketower`