Skip to content

Secure Machinery

On the evolution of security and intelligent machinery

  • About
  • IoT Security

Recent Posts

  • On Agent-Native Research Artifact, or ARA
  • Neurosymbolic reasoning
  • Hyperagents – what they are (and why they are not taking over the world)
  • Hessians and optimizers
  • Loss functions and optimizers – Adam and Muon and the Hessian of the loss function.
  • Understanding Reasoning in Thinking Language Models via Steering Vectors – a summary and analysis
  • Information geometry (and model interventions)
  • Toy models of superposition – Anthropic paper summary
  • Invitation Is All You Need: How a Calendar Event Became an Attack Vector
  • Chronological list of known learned representations (increasing date)
  • Learned Representations in Neural Networks
  • Anthropic: Activations to Interpretable features with Monosemanticity
  • Absolute Zero: zero reliance on external data to improve model reasoning
  • RDMA, Infiniband, RoCE, CXL : High-Performance Networking Technologies for AI
  • vLLM project – overview, comparisons, PagedAttention mechanism
  • AI Risks Repository from MIT
  • Sizing an LLM for GPU memory
  • SageMaker Hyperpod for Distributed Model Training
  • LLM optimization – PEFT, LORA, QLORA
  • Direct Preference Optimization (DPO) vs RLHF/PPO (Reinforcement Learning with Human Feedback, Proximal Policy Optimization)

Recent Comments

Megan Proctor's avatarMegan Proctor on Building Automation Security P…
sanakhan7's avatarsanakhan7 on Feature Vectors, Embeddings, V…
Rodney Dangerfield's avatarRodney Dangerfield on Git Merge. You are in the midd…
Maila's avatarMaila on ML Transformer and GPT-2 …
Unknown's avatarLLM evolution – Anth… on LLM Inferencing is hard…

Archives

  • May 2026
  • January 2026
  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • December 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • February 2024
  • December 2023
  • November 2023
  • September 2023
  • July 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • May 2022
  • March 2022
  • February 2022
  • January 2022
  • November 2021
  • October 2021
  • September 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • May 2020
  • February 2020
  • January 2020
  • December 2019
  • October 2019
  • September 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • January 2019
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015

Categories

  • agents
  • auth
  • AWS
  • big data
  • biometric
  • bitcoin
  • bpf
  • cars
  • certificates
  • cloud
  • companies
  • containers
  • crypto
  • cryptocurrency
  • data governance
  • ddos
  • deep learning
  • future
  • git
  • gpu
  • graph
  • hashes
  • identity
  • industrial
  • information geometry
  • Intrusion detection
  • iOS
  • iot
  • letsencrypt
  • llm
  • malware
  • math
  • meltdown
  • ml
  • mqtt
  • network
  • neurosymbolic reasoning
  • nvidia
  • primes
  • privacy
  • Router
  • safety
  • SAML
  • SDLC
  • SDN
  • Secure Enclave
  • security
  • serverless
  • siem
  • software development
  • supply chain
  • systems
  • TouchID
  • tpm
  • transformers
  • trust
  • Uncategorized
  • vulnerability
  • zerotrust

Meta

  • Create account
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com
Follow Secure Machinery on WordPress.com

Apache Iceberg – what it does and what are the insights behind it

Written by Ruchir Tewari

What does Apache Iceberg do ? It helps address challenges in the Hadoop/Hive ecosystem for maintaining very large data files in data warehouses. These challenges include data updates such as schema changes and deletes, multiple concurrent writers/readers, slow performance for large tables and lack of integration with cloud based storage.

Iceberg –

  • it manages large slow-changing tabular data and gives an SQL interface to the data so that it can be queried efficiently
  • breaks files into partitions and stores those files into an object store such as s3. partitions can be filtered based on the partition key(s). the partitioning is “invisible partitioning” meaning it is done by the system for you, without exposing the details to the client.
  • it separates out metadata management from the data. metadata is not stored in the data files.
  • it separates table schema away from the data . a change of column name will not affect the data files (Schema Evolution).
  • it allows accessing data as it existed at a specific point in time (Time Travel). this feature is useful for auditing, debugging and reproducing issues that occurred in the past . Time travel is implemented using “snapshot isolation” which allows multiple versions of the same table to exist at the same time. (Copy on Write is used in the implementation)
  • it provides ACID compliant transactions for data modifications and snapshot isolation for queries, which help ensure consistency and correctness of data
  • it does all this through a lightweight design with minimal coordination

An HDFS/HMS to S3/Iceberg migration is typically done with a combination of Kafka for replication and Spark for the data format transformation.

The fundamental construct in iceberg is called a table. But table is a confusing word as this is not like a database table that is defined just by a schema with column names and types, foreign keys and so on. An iceberg table is a different kind of entity which tracks schemas and snapshots and manifests. It is better called a “Versioned File Set With Schema and Snapshot Metadata“. This table entity is versioned and has current state and history of its past states. This state is stored in table metadata files. Each metadata change creates a new table metadata file. The table metadata file has a list of schemas and a current schema id. Each schema has a set of fields (name, type), but these fields can be added/deleted/renamed, and can be tagged as required or optional, and can be nested – the identity of a field exists beyond its name/type in a field_id which is essentially immutable (if a field is dropped and readied, then it is treated as a new field with a new id). In addition to the schemas, an iceberg table tracks snapshots and manifest which track which data currently makes up the table. A snapshot answers ‘what is the table at this moment’ – it is the state of the entire table. A manifest is a metadata file about a subset of files in a snapshot.

Figure. iceberg table format is used by multiple engines and is capable of writing to multiple storage types. source.

Ryan Blue’s discussion on the rationale for the design is here and a presentation with performance improvements is at https://conferences.oreilly.com/strata/strata-ny-2018/cdn.oreillystatic.com/en/assets/1/event/278/Introducing%20Iceberg_%20Tables%20designed%20for%20object%20stores%20Presentation.pdf

“By building support for Iceberg, data warehouses can skip the query layer and share data directly. Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses.”

The client is a java jar file which can be embedded.

How does iceberg store files in s3 ?

The top level directory contains the table’s metadata files including the schema and partition information. The metadata files are stored in S3 using the table name as the S3 prefix.

The data files are stored in a directory structure that reflects the table partitioning. Partition values are encoded in the directory name.

s3://bucket-name/table-name/date=YYYY-MM-DD/region=us-west-1/0001.parquet

s3://bucket-name/table-name/date=YYYY-MM-DD/region=us-west-1/0002.parquet

Why a new table format – https://github.com/Netflix/iceberg

A hands-on look at Iceberg table by dremio is here .

A blog on the Adobe experience with Iceberg is here.

A blog on creating a real-time data warehouse with Flink and Iceberg – https://www.alibabacloud.com/blog/flink-%20-iceberg-how-to-construct-a-whole-scenario-real-time-data-warehouse_597824

What are the insights behind iceberg, that lead its rapid adoption ? What were the old assumptions about data warehousing that were being challenged ?

The big changes in data warehousing were mostly driven by four old assumptions breaking.

The first assumption was that storage and compute should be tightly coupled inside one database system. Classic warehouse systems were built around local or tightly managed storage attached to the query engine. As data volumes grew and cloud economics took over, this became too rigid: teams wanted cheap durable storage, elastic compute, and multiple clusters reading the same data. Dremel’s evolution into BigQuery explicitly highlights disaggregated storage and compute as a foundational design principle, and Snowflake’s core architecture is likewise a shared-data, multi-cluster design. Redshift’s later evolution also emphasizes tiered storage, auto-scaling, and cross-cluster sharing. This meant loose coupling between storage and compute.

The second assumption was that the database engine should own the data format and the only serious workload was SQL analytics. That held when warehouses were mainly fed by ETL and queried by BI tools. It weakened when the same data also needed to support data science, machine learning, streaming ingestion, and direct file access. The lakehouse literature describes this as the pressure to combine warehouse features with open direct-access formats such as Parquet and ORC, instead of forcing everything through a closed warehouse storage layer. So this meant loose coupling between the database backend and the application workload.

The third assumption was that catalog metadata was mostly about table definitions, not about exact evolving table state. In Hive-style systems, metadata was largely names, columns, partitions, locations, and storage descriptors. That worked until tables became huge, mutable, multi-engine, and compliance-sensitive. Then it was not enough to know what a table is called and where it lives; systems also needed to know exactly which files are live now, which rows are logically deleted, which schema version applies, and how to time-travel or roll back. Iceberg’s spec is a response to that pressure: it adds snapshots, manifest lists, manifests, delete files, partition specs, and hidden partitioning so that the table’s current state is metadata-defined rather than directory-defined. 

The fourth assumption was that append-oriented batch processing was enough. Once organizations wanted late-arriving data, updates, merges, GDPR deletions, streaming-to-analytics pipelines, and reproducible historical queries, append-only file piles became operationally awkward. Modern lakehouse systems therefore added row-level operations, snapshot isolation, and richer table metadata on top of object stores. Recent work on Iceberg-based row-level operations is explicitly about making those mutable operations practical at very large scale.

Here’s a metadata explorer to compare the metadata stored in iceberg and hive formats – metadata explorer app.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Facebook (Opens in new window) Facebook
Like Loading...
November 6, 2022June 9, 2026 · Posted in big data, data governance, ml · Tagged data warehouse, iceberg, metadata ·

Leave a comment Cancel reply

Post navigation

« Apache Yunikorn
Reasoning, Acting and Composing. ReAct and Self-Ask papers »
Blog at WordPress.com.
  • Comment
  • Reblog
  • Subscribe Subscribed
    • Secure Machinery
    • Already have a WordPress.com account? Log in now.
    • Secure Machinery
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Copy shortlink
    • Report this content
    • View post in Reader
    • Manage subscriptions
    • Collapse this bar

Loading Comments...

    %d