Skip to content

Dov/iceberg docs#34781

Open
DAlperin wants to merge 9 commits intoMaterializeInc:mainfrom
DAlperin:dov/iceberg-docs
Open

Dov/iceberg docs#34781
DAlperin wants to merge 9 commits intoMaterializeInc:mainfrom
DAlperin:dov/iceberg-docs

Conversation

@DAlperin
Copy link
Member

@DAlperin DAlperin commented Jan 21, 2026

https://gist.github.com/DAlperin/0765693e5afbda68c7a0bb05f63e00eb

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.


## Syntax

```mzsql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how kakfa has a data/examples/create_sink_kafka.yml file and uses the include-syntax shortcode, could we create a create_sink_iceberg.yml and use the include-syntax?
This way, this file can do {{% include-syntax ... %}} while the
doc/user/content/sql/create-sink/_index.md‎ can use {{% include-example ...%}}

Also, in the yml file, you can include additional blurbs in there for single-sourcing.

| **NOT ENFORCED** | Optional. Disable validation of key uniqueness. Use only when you have outside knowledge that the key is unique. |
| **COMMIT INTERVAL** `'<interval>'` | **Required.** How frequently to commit snapshots to Iceberg (e.g., `'10s'`, `'1m'`). |

## How Iceberg sinks work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make this ## Details

| `<sink_name>` | The name for the sink. |
| **IF NOT EXISTS** | Optional. Do not throw an error if a sink with the same name already exists. |
| **IN CLUSTER** `<cluster_name>` | Optional. The [cluster](/sql/create-cluster) to maintain this sink. |
| `<item_name>` | The name of the source, table, or materialized view to sink. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the same as in Kafka sinks ... (and that content hasn't been vetted) ... but do you know off the top of your head whether by "source" ... whether we actually mean the subsources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? Probably? I'll investigate


## How Iceberg sinks work

Iceberg sinks continuously stream changes from your source relation to an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of "source relation", which is a little ambiguous because of "source" in the general sense and "source" in our Materialize source sense, maybe "from Materialize".


Iceberg sinks continuously stream changes from your source relation to an
Iceberg table. If the table doesn't exist, Materialize automatically creates it
with a schema matching your source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this sentence? since we mention it in the table above?

relation. Materialize uses these columns to generate equality delete files when
rows are updated or deleted.

If Materialize cannot validate that your key is unique, you'll receive an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that Materialize does a validation .... then, the If Materialize cannot verify ...

you may see commit conflict errors. Materialize will automatically retry, but
if conflicts persist, ensure no other writers are modifying the same table.

## Reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sections here can go into the actual create sink reference page below under details.

- **Record types**: Composite/record types are not currently supported. Use
scalar types or flatten your data structure.

## Troubleshooting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can also go into yaml and be included here and in the create sink reference page below.

- **Record types**: Composite/record types are not supported. Use scalar types
or flatten your data structure.

## Technical reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ... mentioned above ... but woul move things (or also repeat) various content from the guide into here


#### Syntax {#iceberg-catalog-syntax}

```mzsql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about syntax and options in a yaml file. There's a data/examples/create_connection.yml file.

{{< public-preview />}}

This guide walks you through the steps required to export results from
Materialize to [Apache Iceberg](https://iceberg.apache.org/) tables. Iceberg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify that this is for S3 tables only:

This guide walks you through the steps required to export results from
Materialize to Apache Iceberg tables, hosted on AWS S3 Tables


This guide walks you through the steps required to export results from
Materialize to [Apache Iceberg](https://iceberg.apache.org/) tables. Iceberg
sinks are useful for maintaining a continuously updated analytical table that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a bit more here:

Apache Iceberg is an open table format for large-scale analytics datasets that brings reliable, performant ACID transactions, schema evolution, and time travel to data lakes. It gives you data warehouse-like reliability, with the cost advantages of object storage.

Amazon S3 Tables is an AWS feature that provides fully managed Apache Iceberg tables as a native S3 storage type, eliminating the need to manage separate metadata catalogs or table maintenance operations. It automatically handles compaction & snapshot management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg sinks allow you to deliver analytical data from Materialize into an Iceberg table, hosted on AWS S3 Tables. As data changes in Materialize, your Iceberg tables are automatically kept up to date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Am curious as to why we want to pop blurbs that seem more marketing for Apache Iceberg and Amazon S3 tables into our docs.

API. Add the following to your IAM policy:

```json
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all the IAM policies a single statement so that it is easier for the user to copy paste


## Step 2. Create connections

Iceberg sinks require two connections:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's describe what a connection is.

Connections allow Materialize to authenticate to an external system.

}
```

You'll update the external ID after creating the AWS connection in Materialize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it clear that the user needs to keep track of their ARN: Once you have created the IAM role, you should be able to get the ARN from the AWS console. You'll use the ARN in the next step.

2. An **Iceberg catalog connection** to interact with the Iceberg catalog

### Create an AWS connection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Fill in the ARN from step 1

SELECT external_id
FROM mz_internal.mz_aws_connections awsc
JOIN mz_connections c ON awsc.id = c.id
WHERE c.name = 'aws_connection';
Copy link
Contributor

@maheshwarip maheshwarip Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's be more descriptive here:

Run the query below to fetch the external_id. Once you have the external_id, go back to the trust policy for the IAM role created in step 1. Add the external_id to the policy, in the field labeled sts:ExternalId. At the end of this step, your IAM trust policy should look like this:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
			},
			"Action": "sts:AssumeRole",
			"Condition": {
				"StringEquals": {
					"sts:ExternalId": "mz_5ef0f19e-3172-4b35-a7f8-19862d214677_u191"
				}
			}
		}
	]
}

);
```

Replace `<region>` with your AWS region (e.g., `us-east-1`) and `<table-bucket-name>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just tell the user to copy the ARN for their table bucket right?

{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking - is this the right aws acct? Would this differ by region? IE for us-east-1 vs us-west?

JOIN mz_connections c ON awsc.id = c.id
WHERE c.name = 'aws_connection';
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a step here to validate the connection


```mzsql
CREATE CONNECTION aws_connection
TO AWS (ASSUME ROLE ARN = 'arn:aws:iam::<account-id>:role/<role>');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be explicit about setting the region / link out to our usual connection creation steps

```

You'll update the external ID after creating the AWS connection in Materialize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing: we need to tell the user to attach the permissions policy they created before


- Ensure you have access to an AWS account with permissions to create and manage
IAM policies and roles.
- Ensure you have an AWS S3 Tables bucket configured in your AWS account.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • Ensure that you have created a namespace

NAMESPACE = 'my_namespace',
TABLE = 'my_table'
)
USING AWS CONNECTION aws_connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing: envelope upsert

| `NAMESPACE` | The Iceberg namespace (database) containing the table. |
| `TABLE` | The name of the Iceberg table to write to. |
| `KEY` | **Required.** The columns that uniquely identify rows. Used to track updates and deletes. |
| `COMMIT INTERVAL` | **Required.** How frequently to commit snapshots to Iceberg. See [Commit interval tradeoffs](#commit-interval-tradeoffs) below. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also missing envelope here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document what the min & max commit intervals are

{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need to change for self-managed, since users wouldn't go through our cloud acct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: how would this work for our emulator?

|---------------------------------|-------------------------------|
| Lower latency - data visible sooner | Higher latency - data takes longer to appear |
| More small files - can degrade query performance | Fewer, larger files - better query performance |
| Higher catalog overhead | Lower catalog overhead |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: s3 write costs!

- **Partitioning**: Materialize creates unpartitioned tables. Partitioned tables
are not yet supported.
- **Record types**: Composite/record types are not supported. Use scalar types
or flatten your data structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the limitations that the users have to deliver data to the same region

results.
{{< /warning >}}

## Limitations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a section around best practices, recommending that Materialize is the only writer to the iceberg table? And that users have only 1 sink to the iceberg table in question @DAlperin ?

"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the be? Just the namespace? What if I have multiple namespaces (I assume it's one per sink)

Copy link
Contributor

@kay-kim kay-kim Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will go away as this isn't needed. Only the following is needed.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3tables:*",
            "Resource": "*"
        }
    ]
}

Comment on lines 130 to 131
The AWS account ID `664411391173` is the Materialize AWS account. This may
differ for self-managed deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we say "may differ"? It's definitely different for customers I think.

"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "PENDING"
Copy link
Contributor

@def- def- Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of PENDING here? Edit: I see, I should have read further! (or we could explain earlier that it will be filled later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(have a local wip patch to this draft).


### Create an IAM role

Create an [IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of role do I need? I see a bunch of trusted entity types to choose from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this tutorial "Custom trust policy" (have a local wip patch to this draft).

"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::<bucket>",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably mention that you have to replace and

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dennis --
I have a local patch on my machine ... it's still a WIP, but this will be updated to:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3tables:*",
            "Resource": "*"
        }
    ]
}

That is, that statement allows all the specific actions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed up my WIP patch ... I'm still working on it ... but figured current work can still help people as they test.

@DAlperin DAlperin marked this pull request as ready for review February 13, 2026 21:25
@DAlperin DAlperin requested a review from a team as a code owner February 13, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants