big data – Secure Machinery

What does Apache Iceberg do ? It helps address challenges in the Hadoop/Hive ecosystem for maintaining very large data files in data warehouses. These challenges include data updates such as schema changes and deletes, multiple concurrent writers/readers, slow performance for large tables and lack of integration with cloud based storage.

Iceberg –

it manages large slow-changing tabular data and gives an SQL interface to the data so that it can be queried efficiently
breaks files into partitions and stores those files into an object store such as s3. partitions can be filtered based on the partition key(s). the partitioning is “invisible partitioning” meaning it is done by the system for you, without exposing the details to the client.
it separates out metadata management from the data. metadata is not stored in the data files.
it separates table schema away from the data . a change of column name will not affect the data files (Schema Evolution).
it allows accessing data as it existed at a specific point in time (Time Travel). this feature is useful for auditing, debugging and reproducing issues that occurred in the past . Time travel is implemented using “snapshot isolation” which allows multiple versions of the same table to exist at the same time. (Copy on Write is used in the implementation)
it provides ACID compliant transactions for data modifications and snapshot isolation for queries, which help ensure consistency and correctness of data
it does all this through a lightweight design with minimal coordination

An HDFS/HMS to S3/Iceberg migration is typically done with a combination of Kafka for replication and Spark for the data format transformation.

The fundamental construct in iceberg is called a table. But table is a confusing word as this is not like a database table that is defined just by a schema with column names and types, foreign keys and so on. An iceberg table is a different kind of entity which tracks schemas and snapshots and manifests. It is better called a “Versioned File Set With Schema and Snapshot Metadata“. This table entity is versioned and has current state and history of its past states. This state is stored in table metadata files. Each metadata change creates a new table metadata file. The table metadata file has a list of schemas and a current schema id. Each schema has a set of fields (name, type), but these fields can be added/deleted/renamed, and can be tagged as required or optional, and can be nested – the identity of a field exists beyond its name/type in a field_id which is essentially immutable (if a field is dropped and readied, then it is treated as a new field with a new id). In addition to the schemas, an iceberg table tracks snapshots and manifest which track which data currently makes up the table. A snapshot answers ‘what is the table at this moment’ – it is the state of the entire table. A manifest is a metadata file about a subset of files in a snapshot.

Figure. iceberg table format is used by multiple engines and is capable of writing to multiple storage types. source.

Ryan Blue’s discussion on the rationale for the design is here and a presentation with performance improvements is at https://conferences.oreilly.com/strata/strata-ny-2018/cdn.oreillystatic.com/en/assets/1/event/278/Introducing%20Iceberg_%20Tables%20designed%20for%20object%20stores%20Presentation.pdf

“By building support for Iceberg, data warehouses can skip the query layer and share data directly. Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses.”

The client is a java jar file which can be embedded.

How does iceberg store files in s3 ?

The top level directory contains the table’s metadata files including the schema and partition information. The metadata files are stored in S3 using the table name as the S3 prefix.

The data files are stored in a directory structure that reflects the table partitioning. Partition values are encoded in the directory name.

s3://bucket-name/table-name/date=YYYY-MM-DD/region=us-west-1/0001.parquet

s3://bucket-name/table-name/date=YYYY-MM-DD/region=us-west-1/0002.parquet

Why a new table format – https://github.com/Netflix/iceberg

A hands-on look at Iceberg table by dremio is here .

A blog on the Adobe experience with Iceberg is here.

A blog on creating a real-time data warehouse with Flink and Iceberg – https://www.alibabacloud.com/blog/flink-%20-iceberg-how-to-construct-a-whole-scenario-real-time-data-warehouse_597824

What are the insights behind iceberg, that lead its rapid adoption ? What were the old assumptions about data warehousing that were being challenged ?

The big changes in data warehousing were mostly driven by four old assumptions breaking.

The first assumption was that storage and compute should be tightly coupled inside one database system. Classic warehouse systems were built around local or tightly managed storage attached to the query engine. As data volumes grew and cloud economics took over, this became too rigid: teams wanted cheap durable storage, elastic compute, and multiple clusters reading the same data. Dremel’s evolution into BigQuery explicitly highlights disaggregated storage and compute as a foundational design principle, and Snowflake’s core architecture is likewise a shared-data, multi-cluster design. Redshift’s later evolution also emphasizes tiered storage, auto-scaling, and cross-cluster sharing. This meant loose coupling between storage and compute.

The second assumption was that the database engine should own the data format and the only serious workload was SQL analytics. That held when warehouses were mainly fed by ETL and queried by BI tools. It weakened when the same data also needed to support data science, machine learning, streaming ingestion, and direct file access. The lakehouse literature describes this as the pressure to combine warehouse features with open direct-access formats such as Parquet and ORC, instead of forcing everything through a closed warehouse storage layer. So this meant loose coupling between the database backend and the application workload.

The third assumption was that catalog metadata was mostly about table definitions, not about exact evolving table state. In Hive-style systems, metadata was largely names, columns, partitions, locations, and storage descriptors. That worked until tables became huge, mutable, multi-engine, and compliance-sensitive. Then it was not enough to know what a table is called and where it lives; systems also needed to know exactly which files are live now, which rows are logically deleted, which schema version applies, and how to time-travel or roll back. Iceberg’s spec is a response to that pressure: it adds snapshots, manifest lists, manifests, delete files, partition specs, and hidden partitioning so that the table’s current state is metadata-defined rather than directory-defined.

The fourth assumption was that append-oriented batch processing was enough. Once organizations wanted late-arriving data, updates, merges, GDPR deletions, streaming-to-analytics pipelines, and reproducible historical queries, append-only file piles became operationally awkward. Modern lakehouse systems therefore added row-level operations, snapshot isolation, and richer table metadata on top of object stores. Recent work on Iceberg-based row-level operations is explicitly about making those mutable operations practical at very large scale.

Here’s a metadata explorer to compare the metadata stored in iceberg and hive formats – metadata explorer app.

Machine generated data such as clickstreams or database update streams, consists of many rows, which consist of 3 parts

timestamp
text columns or attributes, called dimensions
numerical values such as counts of hits, words, characters etc, called metrics

The desire to rapidly aggregate over such data with a low latency gave rise to Druid.

The data is append heavy and ingestion is a problem, as is querying the data, especially at low-latency.

Druid has two subsystems –

a write-optimized subsystem in the real-time nodes
a read-optimized subsystem in the historical nodes

The data is stored in S3 or HDFS in a column-oriented format.

A good explanatory article that goes into the Druid internal architecture – https://towardsdatascience.com/apache-druid-part-1-a-scalable-timeseries-olap-database-system-af8c18fc766d

Druid is different from Flink and Spark streaming in that it is not a streaming system. Flink can apply real-time data transformations on the data, which can then be ingested into Druid via Kafka, to power real-time dashboards.

Secure Machinery

On the evolution of security and intelligent machinery

Category: big data

Apache Iceberg – what it does and what are the insights behind it

Apache Druid – horizontally scalable time series database