Kill the Small Files Nightmare: Modernizing Data Lakehouses with Apache Iceberg

Database tutorial - IT technology blog
Database tutorial - IT technology blog

The 2 AM PagerDuty Incident

Back in 2021, I spent a long night staring at a production disaster. A Spark job had partially overwritten a critical S3 partition. Because traditional data lakes treat folders as partitions, we had no ‘undo’ button. Restoring from a cold backup was our only hope. We had multiple writers hitting the same prefix, and without ACID compliance, our ‘source of truth’ was effectively a coin toss.

Our data team was losing 16 hours every single week just manually fixing corrupted partitions or troubleshooting performance lags caused by ‘small file’ syndrome. We were running a standard Data Lake on AWS S3 using the Hive table format. It worked during our early days. However, once we hit the petabyte scale, the cracks became canyons.

Why Traditional Data Lakes Break at Scale

S3 isn’t the problem; the Hive table format is. In the Hive world, a table is just a collection of directories. Want to change a column type? You usually have to rewrite the entire dataset. Need to delete one row for a GDPR request? You must read the whole file, filter it, and write it back to disk.

The technical debt usually shows up in three ways:

  • Directory-based partitioning: If a late record arrives with an event_date from 2019, the system struggles to place it or creates yet another tiny, inefficient directory.
  • Zero Atomicity: A failed write leaves ‘orphan’ files. Subsequent queries pick up this garbage, leading to silent data corruption.
  • Schema Rigidity: Renaming user_id to customer_uuid used to be a week-long migration. Most teams just give up and live with confusing column names.

The Contenders: Hive vs. Delta vs. Iceberg

Three candidates emerged when we looked for a way out. Hive was the legacy we needed to dump. Delta Lake was powerful but felt too locked into the Databricks ecosystem at the time. Then we discovered Apache Iceberg.

Originally built at Netflix to handle their massive scale, Iceberg ignores directory structures entirely. Instead, it tracks every individual data file in a ‘manifest file.’ This architectural pivot is huge. It brings the reliability of a SQL database to the cheap, infinite storage of S3.

One lesson I’ve learned from migrating hundreds of tables: your source data is almost never clean. When I need to quickly pivot CSV samples to JSON for schema testing or mapping, I use toolcraft.app/en/tools/data/csv-to-json. It runs entirely in the browser, meaning sensitive fields never leave your machine.

Building the Modern Lakehouse

If you are starting today, Spark with Iceberg is the gold standard. It’s mature and well-documented. Here is how I configure a session to handle a migration strategy.

1. Setting Up the Spark Session

from pyspark.sql import SparkSession

# Pointing Spark to the Iceberg Catalog
spark = SparkSession.builder \
    .appName("IcebergMigration") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg_warehouse") \
    .getOrCreate()

2. Hidden Partitioning: The Killer Feature

Iceberg features Hidden Partitioning. You don’t have to manually create a day column from a timestamp. The engine handles the transformation under the hood. No more redundant columns.

# Create a table partitioned by day without adding extra columns
spark.sql("""
CREATE TABLE local.db.events (
    id bigint,
    event_time timestamp,
    level string,
    message string
)
USING iceberg
PARTITIONED BY (days(event_time))
""")

Evolution Without the Headache

In the old days, renaming a column meant a 12-hour migration job. With Iceberg, it’s a metadata-only change. It takes milliseconds because no data files are touched.

-- Instant changes, zero downtime
ALTER TABLE local.db.events RENAME COLUMN message TO event_details;
ALTER TABLE local.db.events ADD COLUMN correlation_id string;

This metadata layer maps old data files to the new schema on the fly. This single feature saved our team dozens of hours of grunt work last year.

Time Travel: The Debugging Superpower

Imagine a stakeholder asks, “Why did the revenue look different yesterday?” In a traditional lake, you’re guessing. With Iceberg, you can query the table exactly as it existed at 2:00 PM last Tuesday.

-- View the history of snapshots
SELECT * FROM local.db.events.snapshots;

-- Travel back in time
SELECT * FROM local.db.events FOR TIMESTAMP AS OF (current_timestamp() - INTERVAL 2 DAYS);

This is effectively git revert for petabytes. If a batch job writes 50,000 corrupt records, you don’t hunt for files. You just roll back to the previous snapshot.

The Pragmatic Path Forward

Don’t try to migrate everything at once. Start with your most ‘unreliable’ table—the one that breaks your pipelines most often. Run a ‘shadow migration’ where you write to both Hive and Iceberg for seven days to verify the counts match.

Keep your performance sharp with a simple maintenance routine. Iceberg creates many small metadata files, so you need to prune them periodically:

# Expire old snapshots to reclaim S3 space
spark.sql("CALL local.system.expire_snapshots('db.events', timestamp '2026-05-20 00:00:00')")

# Compact tiny files into efficient 128MB chunks
spark.sql("CALL local.system.rewrite_data_files('db.events')")

Moving to Iceberg isn’t just a technical upgrade; it’s about reclaiming your weekends. My on-call rotation is significantly quieter now that we have ACID guarantees. If you’re still wrestling with broken S3 partitions, it’s time to stop the bleeding.

Share: