Skip to content

Add Apache Iceberg load and export support#236

Open
mjschleich wants to merge 35 commits intomainfrom
mjs-add-iceberg-export
Open

Add Apache Iceberg load and export support#236
mjschleich wants to merge 35 commits intomainfrom
mjs-add-iceberg-export

Conversation

@mjschleich
Copy link
Copy Markdown

@mjschleich mjschleich commented Mar 19, 2026

Summary

  • Add ExportIcebergConfig message to transactions.proto with fields: locator, config (warehouse/token/credential), schema, prefix, target_file_size_bytes, and compression
  • For load support, we added IcebergData node, reusing GNFColumns from CSVData.
  • Extend the Export oneof to support Iceberg alongside CSV
  • Add grammar rules, helper functions, and test coverage for the new export type across all three SDKs (Python, Go, Julia)

Test plan

  • make protobuf — regenerated bindings for Python, Go, Julia
  • make parsers / make printers — regenerated from grammar
  • make update-bins / make update-snapshots — test fixtures updated
  • make test — all tests pass (Python 686, Julia 31509, Go ok)

Add ExportIcebergConfig to the Export oneof in transactions.proto with
fields: catalog_uri, namespace, table_name, catalog_properties
(warehouse, token, credential), schema, prefix, target_file_size_bytes,
and compression. Grammar rules, parsers, printers, and tests are
included for all three SDKs (Python, Go, Julia).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mjschleich mjschleich marked this pull request as draft March 19, 2026 21:16
@vustef vustef changed the title Add Apache Iceberg export support Add Apache Iceberg load and export support Mar 26, 2026
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reviewers, I'm not sure if this file is supposed to be generated...

# or appear in hashed structures (e.g. Dict keys, caches).
Base.:(==)(a::IcebergConfig, b::IcebergConfig) =
a.catalog_uri == b.catalog_uri && a.scope == b.scope && a.properties == b.properties
Base.hash(a::IcebergConfig, h::UInt) = hash(a.properties, hash(a.scope, hash(a.catalog_uri, h)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reviewers: I omitted auth_properties from hash and ==, thinking they shouldn't define object's identity. But this then influences that tests don't take this into account either. Not sure if this is fine, nor required.

@vustef vustef marked this pull request as ready for review March 26, 2026 16:00
@vustef vustef requested a review from gbrgr March 27, 2026 09:35
Copy link
Copy Markdown

@gbrgr gbrgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Author

@mjschleich mjschleich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks good to me, just noticed two small issues

Comment on lines +1276 to +1277
deconstruct if builtin.has_proto_field($$, 'csv_config'):
$3: transactions.ExportCSVConfig = $$.csv_config
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deconstruct if builtin.has_proto_field($$, 'csv_config'):
$3: transactions.ExportCSVConfig = $$.csv_config
deconstruct: $3: transactions.ExportCSVConfig = $$.csv_config

I don't think this is needed? 🤔

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems needed, this pattern is used e.g. in :

data
    : edb
      construct: $$ = logic.Data(edb=$1)
      deconstruct if builtin.has_proto_field($$, 'edb'):
        $1: logic.EDB = $$.edb
    | betree_relation
      construct: $$ = logic.Data(betree_relation=$1)
      deconstruct if builtin.has_proto_field($$, 'betree_relation'):
        $1: logic.BeTreeRelation = $$.betree_relation
    | csv_data
      construct: $$ = logic.Data(csv_data=$1)
      deconstruct if builtin.has_proto_field($$, 'csv_data'):
        $1: logic.CSVData = $$.csv_data
    | iceberg_data
      construct: $$ = logic.Data(iceberg_data=$1)
      deconstruct if builtin.has_proto_field($$, 'iceberg_data'):
        $1: logic.IcebergData = $$.iceberg_data

And claude says:
The pretty printer needs to know which variant is active. Without guards, when
printing an iceberg export, it first tries the CSV alternative — $$.csv_config
returns an empty/default ExportCSVConfig, which then fails downstream when trying to
print its sub-fields.

The has_proto_field guard checks which oneof variant is actually set, so only the
matching alternative runs. This is the same pattern used elsewhere for oneofs (e.g.,
Data, Value, Term).

(
source_table_def
0x585c5a1121bd88784c1378aa8837195c)
(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how this is pretty-printed. Claude suggests that with the current pretty printer the only fix is to extract these into non-terminals, but I'm not sure if this is the right way...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants