This week we participated in a one day AWS Immersion Day - Data Foundation with Apache Iceberg on AWS. Here’s a summary of our learnings from the event and some thoughts on how Apache Iceberg could be utilized in our data-oriented business cases.
What is Apache Iceberg?
Apache Iceberg is an open table format, meant to be used for huge datasets in combination of cloud-based object storage systems, like Amazon S3. The main benefit of it is getting SQL query support for data stored as colum-oriented files. Unlike some other solutions for this, Iceberg supports for example schema evolution without side-effects and ACID transactions.
From infrastructure perspective, something like Apache Iceberg + Amazon S3 + Amazon Athena would allow entirely serverless big data SQL execution environment, which is a huge bonus if you don’t want to spend your time maintaining distributed database clusters yourself.
AWS Experience
The AWS Experience workshop was a nice blend of real customer problems and hands-on examples of how Iceberg can be used in combination with AWS. We tried out integrations with Amazon Athena and Amazon EMR, getting some valuable experience on what kind of workflows Iceberg supports.
Since we love Python, as a bonus we also experimented using the workshop example catalogue with PyIceberg and Polars. It seems that Iceberg could be integrated quite easily into our existing Python workflows.
Potential Iceberg use cases for us
Currently most of our customer data requirements are handled with PostgreSQL, which is usually a solid choice. Still, there are some more specialized use cases where AWS + Iceberg could be a relevant combo. Some of our customers have data storage requirements in the petabyte-range, which is ideal for a tool like Iceberg. Especially when query reliability and schema evolutions are more important than retrieval speed, Iceberg provides a very viable alternative.
For example, in one of our projects we store and process hundreds of gigabytes of sensor data as CSV files, and it’s possible that this will increase tenfold in the future. At that point, it could be reasonable alternative to convert the raw data files to Parquet format and store them to S3 to scale the querying efficiently without relying on server based database clusters.
Another great real-life scenario that Iceberg supports is GDPR’s “right to be forgotten”. Iceberg architecture has a specialized metadata layer, which tracks things like schema changes and data deletions. With the metadata layer, it’s very easy to delete singular customer rows even from very large datasets. Earlier solutions, like Apache Hive, require partition level rewrites, which can become very costly in both performance and cost. This is particularly important for EU-based customer requirements.
Is Iceberg a good fit for you?
Do you need to make business decisions based on very large volumes of data? Is it a risk that you need to make complex schema change for them in the future? Is your data already stored as for example Parquet files? You might want to consider Apache Iceberg, especially if you are already using AWS.
Feel free to contact us at info@interjektio.fi and we can explore solutions to your data storage and processing problems, whether it’s with Apache Iceberg or other systems, like our forever-favorite PostgreSQL.