Conversation
|
|
||
| ## Syntax | ||
|
|
||
| ```mzsql |
There was a problem hiding this comment.
Similar to how kakfa has a data/examples/create_sink_kafka.yml file and uses the include-syntax shortcode, could we create a create_sink_iceberg.yml and use the include-syntax?
This way, this file can do {{% include-syntax ... %}} while the
doc/user/content/sql/create-sink/_index.md can use {{% include-example ...%}}
Also, in the yml file, you can include additional blurbs in there for single-sourcing.
| | **NOT ENFORCED** | Optional. Disable validation of key uniqueness. Use only when you have outside knowledge that the key is unique. | | ||
| | **COMMIT INTERVAL** `'<interval>'` | **Required.** How frequently to commit snapshots to Iceberg (e.g., `'10s'`, `'1m'`). | | ||
|
|
||
| ## How Iceberg sinks work |
There was a problem hiding this comment.
I would just make this ## Details
| | `<sink_name>` | The name for the sink. | | ||
| | **IF NOT EXISTS** | Optional. Do not throw an error if a sink with the same name already exists. | | ||
| | **IN CLUSTER** `<cluster_name>` | Optional. The [cluster](/sql/create-cluster) to maintain this sink. | | ||
| | `<item_name>` | The name of the source, table, or materialized view to sink. | |
There was a problem hiding this comment.
I know this is the same as in Kafka sinks ... (and that content hasn't been vetted) ... but do you know off the top of your head whether by "source" ... whether we actually mean the subsources?
There was a problem hiding this comment.
Maybe? Probably? I'll investigate
|
|
||
| ## How Iceberg sinks work | ||
|
|
||
| Iceberg sinks continuously stream changes from your source relation to an |
There was a problem hiding this comment.
instead of "source relation", which is a little ambiguous because of "source" in the general sense and "source" in our Materialize source sense, maybe "from Materialize".
|
|
||
| Iceberg sinks continuously stream changes from your source relation to an | ||
| Iceberg table. If the table doesn't exist, Materialize automatically creates it | ||
| with a schema matching your source. |
There was a problem hiding this comment.
Do we need this sentence? since we mention it in the table above?
| relation. Materialize uses these columns to generate equality delete files when | ||
| rows are updated or deleted. | ||
|
|
||
| If Materialize cannot validate that your key is unique, you'll receive an error. |
There was a problem hiding this comment.
Mention that Materialize does a validation .... then, the If Materialize cannot verify ...
| you may see commit conflict errors. Materialize will automatically retry, but | ||
| if conflicts persist, ensure no other writers are modifying the same table. | ||
|
|
||
| ## Reference |
There was a problem hiding this comment.
The sections here can go into the actual create sink reference page below under details.
| - **Record types**: Composite/record types are not currently supported. Use | ||
| scalar types or flatten your data structure. | ||
|
|
||
| ## Troubleshooting |
There was a problem hiding this comment.
This probably can also go into yaml and be included here and in the create sink reference page below.
| - **Record types**: Composite/record types are not supported. Use scalar types | ||
| or flatten your data structure. | ||
|
|
||
| ## Technical reference |
There was a problem hiding this comment.
Oh ... mentioned above ... but woul move things (or also repeat) various content from the guide into here
|
|
||
| #### Syntax {#iceberg-catalog-syntax} | ||
|
|
||
| ```mzsql |
There was a problem hiding this comment.
ditto about syntax and options in a yaml file. There's a data/examples/create_connection.yml file.
1f68b39 to
1652830
Compare
| {{< public-preview />}} | ||
|
|
||
| This guide walks you through the steps required to export results from | ||
| Materialize to [Apache Iceberg](https://iceberg.apache.org/) tables. Iceberg |
There was a problem hiding this comment.
Let's specify that this is for S3 tables only:
This guide walks you through the steps required to export results from
Materialize to Apache Iceberg tables, hosted on AWS S3 Tables
|
|
||
| This guide walks you through the steps required to export results from | ||
| Materialize to [Apache Iceberg](https://iceberg.apache.org/) tables. Iceberg | ||
| sinks are useful for maintaining a continuously updated analytical table that |
There was a problem hiding this comment.
Let's add a bit more here:
Apache Iceberg is an open table format for large-scale analytics datasets that brings reliable, performant ACID transactions, schema evolution, and time travel to data lakes. It gives you data warehouse-like reliability, with the cost advantages of object storage.
Amazon S3 Tables is an AWS feature that provides fully managed Apache Iceberg tables as a native S3 storage type, eliminating the need to manage separate metadata catalogs or table maintenance operations. It automatically handles compaction & snapshot management.
There was a problem hiding this comment.
Iceberg sinks allow you to deliver analytical data from Materialize into an Iceberg table, hosted on AWS S3 Tables. As data changes in Materialize, your Iceberg tables are automatically kept up to date.
There was a problem hiding this comment.
Q: Am curious as to why we want to pop blurbs that seem more marketing for Apache Iceberg and Amazon S3 tables into our docs.
| API. Add the following to your IAM policy: | ||
|
|
||
| ```json | ||
| { |
There was a problem hiding this comment.
Let's make all the IAM policies a single statement so that it is easier for the user to copy paste
|
|
||
| ## Step 2. Create connections | ||
|
|
||
| Iceberg sinks require two connections: |
There was a problem hiding this comment.
Nit: let's describe what a connection is.
Connections allow Materialize to authenticate to an external system.
| } | ||
| ``` | ||
|
|
||
| You'll update the external ID after creating the AWS connection in Materialize. |
There was a problem hiding this comment.
let's make it clear that the user needs to keep track of their ARN: Once you have created the IAM role, you should be able to get the ARN from the AWS console. You'll use the ARN in the next step.
| 2. An **Iceberg catalog connection** to interact with the Iceberg catalog | ||
|
|
||
| ### Create an AWS connection | ||
|
|
There was a problem hiding this comment.
Nit: Fill in the ARN from step 1
| SELECT external_id | ||
| FROM mz_internal.mz_aws_connections awsc | ||
| JOIN mz_connections c ON awsc.id = c.id | ||
| WHERE c.name = 'aws_connection'; |
There was a problem hiding this comment.
let's be more descriptive here:
Run the query below to fetch the external_id. Once you have the external_id, go back to the trust policy for the IAM role created in step 1. Add the external_id to the policy, in the field labeled sts:ExternalId. At the end of this step, your IAM trust policy should look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "mz_5ef0f19e-3172-4b35-a7f8-19862d214677_u191"
}
}
}
]
}| ); | ||
| ``` | ||
|
|
||
| Replace `<region>` with your AWS region (e.g., `us-east-1`) and `<table-bucket-name>` |
There was a problem hiding this comment.
We could also just tell the user to copy the ARN for their table bucket right?
| { | ||
| "Effect": "Allow", | ||
| "Principal": { | ||
| "AWS": "arn:aws:iam::664411391173:role/MaterializeConnection" |
There was a problem hiding this comment.
Just double checking - is this the right aws acct? Would this differ by region? IE for us-east-1 vs us-west?
| JOIN mz_connections c ON awsc.id = c.id | ||
| WHERE c.name = 'aws_connection'; | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Let's add a step here to validate the connection
|
|
||
| ```mzsql | ||
| CREATE CONNECTION aws_connection | ||
| TO AWS (ASSUME ROLE ARN = 'arn:aws:iam::<account-id>:role/<role>'); |
There was a problem hiding this comment.
We should be explicit about setting the region / link out to our usual connection creation steps
| ``` | ||
|
|
||
| You'll update the external ID after creating the AWS connection in Materialize. | ||
|
|
There was a problem hiding this comment.
Missing: we need to tell the user to attach the permissions policy they created before
|
|
||
| - Ensure you have access to an AWS account with permissions to create and manage | ||
| IAM policies and roles. | ||
| - Ensure you have an AWS S3 Tables bucket configured in your AWS account. |
There was a problem hiding this comment.
Also:
- Ensure that you have created a namespace
| NAMESPACE = 'my_namespace', | ||
| TABLE = 'my_table' | ||
| ) | ||
| USING AWS CONNECTION aws_connection |
There was a problem hiding this comment.
missing: envelope upsert
| | `NAMESPACE` | The Iceberg namespace (database) containing the table. | | ||
| | `TABLE` | The name of the Iceberg table to write to. | | ||
| | `KEY` | **Required.** The columns that uniquely identify rows. Used to track updates and deletes. | | ||
| | `COMMIT INTERVAL` | **Required.** How frequently to commit snapshots to Iceberg. See [Commit interval tradeoffs](#commit-interval-tradeoffs) below. | |
There was a problem hiding this comment.
Also missing envelope here
There was a problem hiding this comment.
we should document what the min & max commit intervals are
| { | ||
| "Effect": "Allow", | ||
| "Principal": { | ||
| "AWS": "arn:aws:iam::664411391173:role/MaterializeConnection" |
There was a problem hiding this comment.
This might need to change for self-managed, since users wouldn't go through our cloud acct
There was a problem hiding this comment.
Also: how would this work for our emulator?
| |---------------------------------|-------------------------------| | ||
| | Lower latency - data visible sooner | Higher latency - data takes longer to appear | | ||
| | More small files - can degrade query performance | Fewer, larger files - better query performance | | ||
| | Higher catalog overhead | Lower catalog overhead | |
| - **Partitioning**: Materialize creates unpartitioned tables. Partitioned tables | ||
| are not yet supported. | ||
| - **Record types**: Composite/record types are not supported. Use scalar types | ||
| or flatten your data structure. |
There was a problem hiding this comment.
Let's add the limitations that the users have to deliver data to the same region
1652830 to
769742c
Compare
769742c to
0a4339a
Compare
| results. | ||
| {{< /warning >}} | ||
|
|
||
| ## Limitations |
There was a problem hiding this comment.
Should we add a section around best practices, recommending that Materialize is the only writer to the iceberg table? And that users have only 1 sink to the iceberg table in question @DAlperin ?
| "s3:PutObject", | ||
| "s3:DeleteObject" | ||
| ], | ||
| "Resource": "arn:aws:s3:::<bucket>/<prefix>/*" |
There was a problem hiding this comment.
What should the be? Just the namespace? What if I have multiple namespaces (I assume it's one per sink)
There was a problem hiding this comment.
this will go away as this isn't needed. Only the following is needed.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3tables:*",
"Resource": "*"
}
]
}
| The AWS account ID `664411391173` is the Materialize AWS account. This may | ||
| differ for self-managed deployments. |
There was a problem hiding this comment.
Why do we say "may differ"? It's definitely different for customers I think.
| "Action": "sts:AssumeRole", | ||
| "Condition": { | ||
| "StringEquals": { | ||
| "sts:ExternalId": "PENDING" |
There was a problem hiding this comment.
What is the meaning of PENDING here? Edit: I see, I should have read further! (or we could explain earlier that it will be filled later)
There was a problem hiding this comment.
(have a local wip patch to this draft).
|
|
||
| ### Create an IAM role | ||
|
|
||
| Create an [IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) |
There was a problem hiding this comment.
What kind of role do I need? I see a bunch of trusted entity types to choose from.
There was a problem hiding this comment.
For this tutorial "Custom trust policy" (have a local wip patch to this draft).
| "Action": [ | ||
| "s3:ListBucket" | ||
| ], | ||
| "Resource": "arn:aws:s3:::<bucket>", |
There was a problem hiding this comment.
Should probably mention that you have to replace and
There was a problem hiding this comment.
Hey Dennis --
I have a local patch on my machine ... it's still a WIP, but this will be updated to:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3tables:*",
"Resource": "*"
}
]
}
That is, that statement allows all the specific actions.
There was a problem hiding this comment.
I pushed up my WIP patch ... I'm still working on it ... but figured current work can still help people as they test.
https://gist.github.com/DAlperin/0765693e5afbda68c7a0bb05f63e00eb
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.