10 min

Building a High-Performance Open-Source Data Lake on Amazon S3

AI Agents

Artificial Intelligence

Enterprise

Building a High-Performance Open-Source Data Lake on Amazon S3

listen article

stop listening

Did you like the article?
Share it!

Introduction

Organizations are increasingly looking to Amazon S3 as the foundation of an open data lakehouse, providing an alternative to expensive proprietary warehouses like Snowflake or BigQuery. By leveraging S3's low-cost storage and open-source query engines, teams can achieve maximum query speed at minimal cost and with minimal engineering overhead.

The main challenge is choosing the right combination of query engine and table format to get Snowflake-like performance without the Snowflake price tag. This overview compares the fastest open-source engines for querying data on S3, discusses the merits of different storage formats (Parquet, Iceberg, Delta Lake), explores practical architecture patterns, and explains how these solutions integrate with dbt for SQL-based transformations.

Blazing-Fast Open-Source Query Engines on S3

Modern open-source query engines can scan S3-based datasets remarkably fast—sometimes even outperforming commercial data warehouses. Here are some leading engines and their strengths:

Apache DataFusion (Rust/Arrow) – Lightweight, Embedded

DataFusion is a Rust-native SQL engine (part of Apache Arrow) that has recently set records in single-node Parquet query performance. As of late 2024, DataFusion was benchmarked as the fastest engine for querying Parquet on a single node, outperforming DuckDB and even ClickHouse on identical hardware.

This is notable because DataFusion is fully open-source and community-driven, with no large company backing it. Its vectorized execution and Rust optimizations have pushed it to the top of the charts. DataFusion is designed as an embedded library, making it easy to integrate into custom applications or serverless functions. For distributed use, the Ballista project extends DataFusion to a cluster environment.

Overall, DataFusion offers cutting-edge speed in a small footprint, though it's a younger project and typically requires some coding to use (there is no out-of-the-box SQL service).

DuckDB – In-Process "SQLite for Analytics"

DuckDB is an in-process OLAP database known for its ease of use and impressive performance on analytical SQL. It comes as a simple single-file library or binary and can process billions of rows using vectorized execution. DuckDB can query Parquet files on S3 directly (using built-in S3 filesystem support) with no ETL step, essentially treating S3 like a remote disk.

For moderate data sizes (tens to low hundreds of GB), DuckDB often shines. For example, one benchmark on about 100GB of TPC-DS data found DuckDB to be the fastest engine for about 75% of the test queries (Trino was faster in the other 25%). DuckDB is ideal for interactive analytics and transformation pipelines where a small team can run queries from their laptop or an ephemeral server.

It is low-overhead, requiring no separate server to manage, and excels at complex SQL on a single node. However, DuckDB is not distributed; it scales up (using multiple cores on one machine) but not out. For very large datasets or high concurrency, you would need to shard the work or consider a different engine.

ClickHouse – Clustered Columnar Database

ClickHouse is a battle-tested open-source OLAP database known for its blazing speed at scale. Traditionally, ClickHouse stores data in its own columnar format (MergeTree engines) on local disks, but it also has first-class support for querying external data on S3 without ingestion, including Parquet files and table formats like Iceberg and Delta.

Recent engineering efforts have optimized ClickHouse's engine to run directly on Parquet files as efficiently as possible. A ClickHouse cluster can scan huge S3-resident datasets with performance approaching that of its native format. The ClickHouse team has shown that querying Parquet files is nearly as fast as using its optimized native storage engine. This means you can get interactive speeds on S3 data, and if needed, always load hot data into local storage for an extra boost.

ClickHouse excels at large-scale, high-concurrency workloads. It is massively parallel and can be distributed across many nodes for petabyte-scale analytics. It does come with more operational overhead: running a ClickHouse cluster (or even a single server) requires provisioning and management. However, it is a worthy open-source contender to Snowflake for those willing to manage infrastructure, especially given its real-time capabilities and mature feature set.

Trino (PrestoSQL) – Distributed SQL Query Engine

Trino is the open-source successor to Presto, designed for fast interactive SQL on large data lakes. It can query data from S3 via a Hive/Glue catalog or directly via connectors, all without requiring data to be copied—making it perfect for a federated query layer. Trino's advantage is a feature-rich SQL interface and proven scalability for ad-hoc analytics.

It is used in platforms like AWS Athena (which uses Presto under the hood) and can handle a wide range of use cases from ETL to BI dashboards. Compared to big-data frameworks like Spark, Trino is much more interactive and low-latency. Spark might handle everything, but it is not optimized for quick queries, whereas Trino "just works" for SQL with far less tuning and complexity than old-school Hive.

The trade-off is that Trino requires a persistent cluster of nodes (coordinator and workers), so there is some infrastructure to maintain. It lacks fault tolerance on long queries (if a node fails, the query restarts), but for batch SQL this is usually acceptable. Many see Trino as a top open-source Snowflake alternative because it provides a cloud data warehouse SQL experience without vendor lock-in.

In practice, Trino is ideal if you need concurrent, distributed querying over large datasets and can tolerate some operational setup. It is also extensible via connectors, allowing you to query other databases or object stores within one SQL interface.

Other Engines (Presto, Spark, Dremio, etc.)

There are many other engines in the ecosystem. PrestoDB (the Facebook-led fork of Presto) is similar to Trino but less community-driven; most recommend Trino for new projects. Apache Spark SQL can query S3 Parquet/Iceberg and is widely used for ETL, but its overhead and latency are higher for interactive use. That said, Spark is a workhorse for large-scale transformation pipelines and integrates deeply with the Hadoop ecosystem.

Apache Flink is another distributed engine, often used for streaming, that supports batch SQL on Iceberg—great for hybrid real-time requirements. Dremio is an open-source data lake query platform (built on Apache Arrow) that offers another route to high-speed S3 analytics with a user-friendly interface. Emerging projects like MotherDuck (managed DuckDB) and chDB (an embedded ClickHouse library) show that even traditionally server-based engines are becoming available in lightweight modes.

The open-source landscape offers a continuum: from in-process engines (DuckDB, DataFusion) that require zero infrastructure but scale to around 100GB, up to clustered powerhouses (ClickHouse, Trino) that handle terabytes with ease. You can choose the engine that fits your performance needs and your team's operational capacity.

Parquet vs. Iceberg vs. Delta Lake: Choosing a Storage Format

Storage format plays a critical role in an S3 data lake's performance and management. All of the engines above can query Parquet files—Parquet is the de facto standard columnar format due to its compression and efficient encoding. The simplest architecture is to keep data as Parquet in S3 and use a Hive-style metastore or Glue Data Catalog for table definitions, or even just point engines at S3 paths.

This works well for append-only datasets or read-heavy analytics. Parquet's efficiency is so high that engines like ClickHouse can scan it nearly as fast as their own formats. However, plain Parquet on S3 has limitations: no built-in support for transactions, concurrent writes, or automatic partition management. This is where table formats come in.

Apache Iceberg

Iceberg is an open table format designed to bring ACID transactions and schema evolution to data lakes. Iceberg tracks table metadata (file lists, schema, partition info) in a manifest tree, allowing reliable upserts, deletes, and time travel queries on S3 data. It is engine-agnostic and has broad compatibility (Spark, Flink, Trino, Presto, Hive, Snowflake, and others can all read Iceberg tables).

Iceberg is often favored in open stacks because it is completely vendor-neutral and was designed through the Apache process. It supports hidden partitioning, version rollback, and other features to keep large tables performant and maintainable. If your architecture is diverse or you want maximum future flexibility, Iceberg is a strong choice. Many consider Iceberg the default for non-Spark environments—if you are using an open data stack, Apache Iceberg is probably best.

Delta Lake

Delta Lake, originally from Databricks, also provides ACID transactions and is now open-source (since 2019) under the Linux Foundation. Delta's metadata is stored in the "Delta Log" (a transaction log of JSON and Parquet checkpoint files) rather than Iceberg's manifest files. Feature-wise, Iceberg and Delta are converging—both have ACID, time travel, and similar capabilities.

In practice, Delta Lake is a great option especially if you are coming from a Spark-centric world. It was built to integrate tightly with Apache Spark and is highly optimized in that context (the Databricks runtime uses Delta by default for strong performance). Some benchmarks indicate Delta can be extremely fast for certain workloads, as Databricks has invested heavily in performance. If your team primarily uses Spark or Databricks, Delta might be a natural fit—if you are a Spark user, Delta Lake may be the better choice.

Outside of Spark, Delta is gaining support (for example, Presto/Trino and Flink can read Delta), but Iceberg still has a bit more multi-engine adoption.

Plain Parquet

Do not underestimate sticking with vanilla Parquet plus Hive-style partitions (or even CSV for small data) if your use case is read-only or append-only. The simplest architecture—Parquet on S3—has no additional metadata layers to maintain and is fully compatible with all engines. You lose ACID capabilities, so it is not ideal if you need to update or delete records frequently.

But if you can do append-only fact tables and re-create derived tables in batch, plain Parquet is the simplest and often the most cost-efficient route. Many teams start here: raw data stored as partitioned Parquet, queried via engines like Trino or ClickHouse, which apply their own optimizations (predicate pushdown, file pruning via min/max stats, and so on). Should needs grow, you can always upgrade to Iceberg or Delta later (both have tools to convert existing Parquet datasets into tracked tables).

Which format to choose?

If you need reliable merges or deletes on S3 data, go with a table format (Iceberg if you prefer the open ecosystem, Delta if you are heavily Spark-focused). If your pipeline is mostly batch inserts and read-only queries, Parquet alone is perfectly fine and avoids extra complexity.

Iceberg and Delta do introduce a lightweight metadata management overhead—for example, you will run compaction jobs or maintain a catalog—but they pay off by preventing "data swamp" problems as your lake grows. Both Iceberg and Delta are open source, so you are not locked in, and the choice can come down to toolset: for example, Trino and Flink have first-class Iceberg support, whereas Databricks and Spark have the most mature Delta integration.

Many large organizations are standardizing on Iceberg for a vendor-agnostic data lakehouse, with Delta (and even a third format, Apache Hudi) as alternatives in specific ecosystems.

Architecture Patterns: Maximizing Speed and Minimizing Cost

Designing an S3-based data platform involves balancing performance, cost, and complexity. Here are some practical architecture patterns that combine the above engines and formats.

Serverless Query Layer (On-Demand Compute + S3 Storage)

One of the big advantages of decoupling storage (S3) from compute is the ability to scale compute independently and even turn it off when not in use. A common pattern for low overhead is to use a serverless or on-demand query engine against S3. For example, AWS Athena is a fully-managed service that queries S3 with Presto/Trino.

In the open-source realm, you can achieve similar setups: deploy Trino on Kubernetes or AWS EMR so that clusters auto-scale or shut down when idle, or even invoke lightweight engines for one-off queries. An emerging idea is to use in-process engines (DuckDB or DataFusion) inside ephemeral cloud functions or jobs—for instance, an AWS Glue job that spins up, uses DuckDB to query or transform Parquet in S3, then shuts down.

This approach yields zero persistent infrastructure: you pay only for the query runtime and S3 storage/IO. It is extremely cost-efficient for spiky workloads or smaller teams. DataFusion in particular, being a library, could be embedded in a Lambda or as a Rust microservice to provide a "bring-your-own-Athena" experience with no servers to manage.

The downside of serverless is the cold-start latency and potential limits on concurrency or memory (for example, Lambdas have memory limits, and Trino clusters take time to spin up). Still, this pattern minimizes engineering effort—you bypass managing database servers altogether. The focus shifts to optimizing S3 layout (such as partitioning data to minimize scan costs) and query tuning, rather than infrastructure.

Dedicated Cluster (High-Performance Compute on S3)

When low-latency and high concurrency are required continuously (for example, powering BI dashboards or complex reports), a dedicated cluster of compute nodes may be justified. ClickHouse clusters are a prime example: you might run a cluster of ClickHouse servers that either ingest data from S3 or query it in place.

One strategy is to load recent hot data into ClickHouse's local storage (for sub-second queries on active data) while keeping historical data in Parquet on S3 and using external table functions to query it on demand. ClickHouse's ability to treat S3 as just another storage tier is very useful—you can insert data from S3, or query external files via S3 table functions.

Similarly, a Trino cluster could be always-on, perhaps co-located near the S3 buckets, to serve ad-hoc queries quickly. Running your own cluster does incur engineering overhead (you need to handle scaling, upgrades, monitoring, and so on), but for heavier workloads the trade-off is worth it. These open source engines are still typically cheaper than proprietary warehouses because you are running on raw EC2 or Kubernetes infrastructure at cloud cost, not paying a premium. Also, you have full control to optimize (for example, choose instances with high NVMe cache for ClickHouse, or use spot instances for Trino workers to save money).

Spark clusters can also be used in a similar dedicated way for ETL jobs that run on a schedule—for instance, an EMR cluster each night to do large transformations and output the results to S3 in Parquet or Iceberg. The key is that storage remains on S3, so you avoid duplicating large datasets across systems—the cluster is purely for compute.

This pattern provides maximum performance (especially if you tune the engines) and is suitable when query load is constant or when very large datasets (multi-terabyte) must be processed by parallel compute nodes.

Embedded Analytics (Local or Edge Processing)

Another pattern gaining popularity is pushing the query engine to the analyst or to the edge. For example, an analyst with a decent laptop can run DuckDB to query S3 data directly, or use a notebook that connects to S3 and processes data locally. This is surprisingly effective for moderate data sizes and completely eliminates server costs for those queries.

There are cases of analysts scanning tens of millions of rows from S3 with DuckDB on their laptop—the performance is often within seconds, thanks to columnar processing. Likewise, inside a production pipeline you might embed DuckDB or DataFusion within an application (Python, Rust, and so on) to perform transformations close to the data consumer.

This in-process approach has minimal overhead: no network hops, no separate database, just reading Parquet from S3 straight into memory and processing it. It is essentially bringing the warehouse to the data consumer. The obvious limitation is that it does not scale beyond one machine's resources, but modern workstations or cloud VMs can handle a surprisingly large workload (100GB or more in some cases) when using these highly optimized engines.

Embedded engines are also great for testing and development, such as using DuckDB to prototype a data model locally before deploying it to a larger engine.

Choosing a Pattern

The choice often comes down to scale and use case. Many organizations adopt a hybrid approach: for example, use DuckDB for local data exploration or small data marts, a serverless Trino or Athena for on-demand querying, and perhaps a ClickHouse cluster for critical low-latency analytics on key datasets.

The decoupling of S3 storage means you can mix and match these engines on the same data. This is a huge win for cost optimization—you might only incur big compute costs for the workloads that truly need it, and use cheaper or slower methods for the rest. Open formats like Parquet and Iceberg ensure the data is readable by all tools, avoiding lock-in.

Integration with dbt for SQL Pipelines

A major question for teams adopting open-source data lake architectures is how to handle transformations and modeling, which is traditionally the role of ETL or ELT in data warehouses. dbt (Data Build Tool) has become the standard for managing SQL-based transformation pipelines (the "T" in ELT), and it works well with open-source engines. Nearly all the engines discussed have adapters or integrations with dbt.

DuckDB and dbt

DuckDB's simplicity makes it an excellent companion to dbt. There is a supported dbt-duckdb plugin that lets dbt run models using DuckDB as the SQL execution engine. This setup is great for small teams or moderate data sizes—you can run your entire dbt project locally or in a single VM, with DuckDB handling all the SQL.

Because DuckDB can read and write Parquet, your dbt models can output new Parquet files on S3 or query them in place. The appeal is zero infrastructure—just run dbt build and it uses DuckDB under the hood. Many data teams use this for development and testing, and some even for production with datasets in the tens of gigabytes.

The limitation is that DuckDB is single-node; if your models need to transform terabyte-scale data, you will need a powerful machine or a different engine.

Trino/Presto and dbt

There is a well-established dbt-trino adapter (and one for Presto) that allows dbt to run models on a Trino cluster. This is a popular combination for an open Lakehouse stack: use Trino as the query engine for both interactive queries and dbt-managed ELT. Since Trino can create tables in S3 (via Hive metastore or Iceberg), your dbt models can materialize as new tables or views in your S3 data lake.

This setup does require that you have a Trino (or Presto) service running, but many companies already maintain one for BI queries, so leveraging it for dbt is natural. The benefit is scalability—a Trino cluster can handle very large transformations in SQL, similar to how BigQuery or Snowflake would.

The dbt integration abstracts away the engine details: models are just SQL, and Trino executes them. If you are migrating from a warehouse to S3 and Trino, dbt makes it easier since you can port your SQL models over and swap the target to Trino. Note that not 100% of SQL syntax is identical (for example, TSQL versus Trino SQL differences), but dbt and Trino both adhere to standard SQL fairly well.

ClickHouse and dbt

ClickHouse can also be used with dbt via a community-supported adapter. In this combination, dbt will connect to a ClickHouse server and issue CREATE TABLE AS SELECT or CREATE VIEW statements based on your models. It is a slightly less common pairing, but it is gaining traction for those who want the extreme speed of ClickHouse in their transformation layer.

One thing to watch is that ClickHouse's SQL dialect has some differences, and it traditionally was not designed for ad-hoc transformation as much as for analytics on already-loaded data. Nevertheless, basic transformations and materializations (views, tables, incremental load) can be done, and the ClickHouse team has documented examples of using dbt with ClickHouse.

The advantage is that you could use ClickHouse both as your transformation engine and your serving layer for BI, unifying everything in one system. Just remember that this means keeping a ClickHouse service up and running for dbt to use.

Apache Spark and dbt

For those already invested in Spark (perhaps using PySpark or Spark SQL on EMR or Databricks), dbt has an adapter for Spark as well. This is often used in more complex pipelines where some transformations are easier to express in Spark (Python or Scala) but others are just SQL—dbt allows wrapping Spark SQL transformations in a managed DAG.

However, Spark with dbt will have higher latency per query (as Spark jobs spin up), so it might not be as snappy as DuckDB or Trino for interactive model development. Still, it is an open-source option and can leverage Spark's ability to handle large data and various data sources. Spark is also the execution engine you would use if you want to unify streaming and batch, though dbt itself is focused on batch transformations.

Other Engines

If you are using Flink for streaming, you would likely manage transformations in code (Flink SQL is not commonly used via dbt). DataFusion currently does not have a direct dbt adapter (since it is not an always-on SQL service), but conceivably one could write a custom integration to run DataFusion queries within a dbt run (this would be an advanced use case). Some newer cloud offerings (like Materialize for streaming SQL) are not yet common in the dbt world.

Broadly speaking, dbt has adapters for all the major open SQL engines, which means you can keep using the same development workflow and version-controlled SQL files, just changing the target warehouse. This lowers the barrier for teams to move from, for example, Snowflake to an S3-based platform—you do not have to rewrite your entire transformation logic, just adjust SQL for any dialect differences and switch dbt connectors.

Conclusion: A Modern OSS Lakehouse that Competes with Cloud Warehouses

In 2025, the open-source ecosystem for data analytics on Amazon S3 is thriving. You can combine S3's cheap, durable storage with engines like DataFusion, DuckDB, ClickHouse, or Trino to achieve performance that rivals commercial data warehouses—often at a fraction of the cost. The latest benchmarks show that open engines are not only catching up, but in some cases surpassing the proprietary systems in speed.

With smart use of open table formats (Iceberg or Delta) and new architecture approaches (serverless compute, embedded engines), the engineering overhead can be kept minimal. In-process engines require no infrastructure, while distributed engines have become easier to deploy and manage with containerization and cloud services.

Crucially, the flexibility of an open data lake means you are not locked into one vendor's stack—you can iterate and adopt new tools as they emerge, all while keeping your data in open formats on S3. For technically savvy teams and even for executives watching the bottom line, an open-source S3 data lakehouse offers a compelling high-performance, low-cost analytics solution.

It may not be "set it and forget it" like Snowflake—you do take on some responsibility in assembling the pieces—but the gap is closing fast. Usability is improving (SQL interfaces, dbt integration, and so on), and the community is rapidly innovating. By choosing the right engines and formats for your needs, you can deliver fast queries and robust data pipelines on S3 without the hefty price tag.

In short, you can have your analytics cake and eat it too

‍