Skip to content

Comments

Dataflow Template for Firestore to Firestore Documents Copy (Clone of PR 3098)#3392

Open
jingqizz wants to merge 3 commits intoGoogleCloudPlatform:mainfrom
jingqizz:pr3098-noloadtest
Open

Dataflow Template for Firestore to Firestore Documents Copy (Clone of PR 3098)#3392
jingqizz wants to merge 3 commits intoGoogleCloudPlatform:mainfrom
jingqizz:pr3098-noloadtest

Conversation

@jingqizz
Copy link

@gemini-code-assist
Copy link

Summary of Changes

Hello @jingqizz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust Dataflow template designed to efficiently copy documents from one Firestore database to another. This new capability streamlines data migration and replication workflows for Firestore users. Accompanying this template are new utility classes for managing Firestore resources during testing, ensuring the reliability and maintainability of the new pipeline. The changes also include updated project configurations and detailed documentation to facilitate adoption and deployment.

Highlights

  • New Dataflow Template: Introduced a new Dataflow template named 'Firestore to Firestore' for batch copying documents between two Firestore databases.
  • Firestore Resource Management: Added new resource manager classes (FirestoreAdminResourceManager and FirestoreResourceManager) and associated exceptions for managing Firestore databases and collections within integration tests.
  • Beam Transforms for Firestore: Implemented custom Apache Beam PTransform and DoFn classes (CreatePartitionQueryRequestFn, PrepareWritesFn, RunQueryResponseToDocumentFn) to handle Firestore-specific operations like partitioning queries, preparing writes, and extracting documents.
  • Dependency Updates and Module Integration: Updated the it/google-cloud-platform/pom.xml to include necessary Firestore dependencies and added the new firestore-to-firestore module to the parent v2/pom.xml.
  • Documentation and Terraform Support: Provided comprehensive documentation (README_Cloud_Firestore_to_Firestore.md) for the new template, including usage instructions and a Terraform configuration (dataflow_job.tf) for deployment.
Changelog
  • it/google-cloud-platform/pom.xml
    • Updated google-cloud-spanner dependency version to 6.102.0.
    • Added google-cloud-pubsublite dependency.
    • Added google-cloud-firestore and google-cloud-firestore-admin dependencies.
  • it/google-cloud-platform/src/main/java/org/apache/beam/it/gcp/firestore/FirestoreAdminResourceManager.java
    • Added new class FirestoreAdminResourceManager to manage Firestore database creation and deletion for integration tests.
  • it/google-cloud-platform/src/main/java/org/apache/beam/it/gcp/firestore/FirestoreAdminResourceManagerException.java
    • Added custom exception FirestoreAdminResourceManagerException for Firestore admin resource management operations.
  • it/google-cloud-platform/src/main/java/org/apache/beam/it/gcp/firestore/FirestoreResourceManager.java
    • Added new class FirestoreResourceManager to manage Firestore collection writes, reads, and cleanup for integration tests.
  • it/google-cloud-platform/src/main/java/org/apache/beam/it/gcp/firestore/FirestoreResourceManagerException.java
    • Added custom exception FirestoreResourceManagerException for Firestore resource management operations.
  • it/google-cloud-platform/src/main/java/org/apache/beam/it/gcp/firestore/package-info.java
    • Added package-level Javadoc for Firestore resource management classes.
  • v2/firestore-to-firestore/README_Cloud_Firestore_to_Firestore.md
    • Added documentation for the new Firestore to Firestore Dataflow template, including parameters, getting started guide, and Terraform usage.
  • v2/firestore-to-firestore/pom.xml
    • Added new Maven module firestore-to-firestore with dependencies for the Dataflow template.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/templates/FirestoreToFirestore.java
    • Added the main class for the Firestore to Firestore Dataflow template, implementing the pipeline logic for document copying.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/templates/package-info.java
    • Added package-level Javadoc for the Firestore to Firestore template.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/transforms/CreatePartitionQueryRequestFn.java
    • Added a Beam PTransform to create Firestore PartitionQueryRequest objects from collection IDs.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/transforms/PrepareWritesFn.java
    • Added a Beam DoFn to transform source Firestore Document objects into Write requests for the destination database.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/transforms/RunQueryResponseToDocumentFn.java
    • Added a Beam DoFn to extract Document objects from Firestore RunQueryResponse messages.
  • v2/firestore-to-firestore/src/main/java/com/google/cloud/teleport/v2/transforms/package-info.java
    • Added package-level Javadoc for the Firestore-related Beam transforms.
  • v2/firestore-to-firestore/src/test/java/com/google/cloud/teleport/v2/transforms/CreatePartitionQueryRequestFnTest.java
    • Added unit tests for CreatePartitionQueryRequestFn to verify correct generation of partition queries.
  • v2/firestore-to-firestore/src/test/java/com/google/cloud/teleport/v2/transforms/PrepareWritesFnTest.java
    • Added unit tests for PrepareWritesFn to ensure correct document name transformation and field preservation.
  • v2/firestore-to-firestore/src/test/java/com/google/cloud/teleport/v2/transforms/RunQueryResponseToDocumentFnTest.java
    • Added unit tests for RunQueryResponseToDocumentFn to confirm proper document extraction from query responses.
  • v2/firestore-to-firestore/terraform/Cloud_Firestore_to_Firestore/dataflow_job.tf
    • Added Terraform configuration for deploying the Firestore to Firestore Dataflow Flex Template job.
  • v2/pom.xml
    • Added firestore-to-firestore as a new module to the parent pom.xml.
Activity
  • The pull request is a clone of an existing PR (3098), indicating a re-submission or continuation of previous work.
  • The author jingqizz has submitted these changes, with pacoavila808 being carbon-copied, suggesting prior involvement or interest from them.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@jingqizz jingqizz marked this pull request as draft February 20, 2026 17:19
@ninjaAB-5 ninjaAB-5 requested a review from a team February 20, 2026 19:09
@jingqizz jingqizz marked this pull request as ready for review February 20, 2026 21:32
ninjaAB-5
ninjaAB-5 previously approved these changes Feb 24, 2026
p.apply(Create.of(collectionIdsList))
.apply(
new CreatePartitionQueryRequestFn(
sourceProjectId, sourceDatabaseId, maxNumWorkers > 1 ? maxNumWorkers : 20L));
Copy link

@pacoavila808 pacoavila808 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should probably store the default max workers in a constant, and include it in the documentation of the flag as well.

(update: looks like this is a shared flag, what's the default as is?)

collectionIdsList = getAllCollectionIds(sourceProjectId, sourceDatabaseId);
} catch (Exception e) {
LOG.error("Failed to list collections: {}", e.getMessage(), e);
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually want to throw the exception here to make sure the pipeline clearly fails in case this happens.

description = {
"The Firestore to Firestore template is a batch pipeline that reads documents from one"
+ " <a href=\"https://cloud.google.com/firestore/docs\">Firestore</a> database and writes"
+ " them to another Firestore database.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a callout that this does not support Enterprise edition databases as the source?

.addParameter("sourceDatabaseId", SOURCE_DATABASE_ID)
.addParameter("destinationProjectId", PROJECT)
.addParameter("destinationDatabaseId", DESTINATION_DATABASE_ID)
.addParameter("collectionIds", collectionId)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case that doesn't specify collectionIds, and verify that all collections are copied?

<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-spanner</artifactId>
<version>6.104.0</version>
<version>6.102.0</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Ideally we wouldn't downgrade the spanner version here as this looks like a shared dependency.

(sorry can't remember if this was part of my original changes and what the context was).

<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-pubsublite</artifactId>
</dependency>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here I think this is unrelated to our change, remove and very it's not needed?

validateOptions(options);
LOG.info("Pipeline options parsed and validated.");

Pipeline p = Pipeline.create(options);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a pipeline option to log errors to a given GCS path? See https://docs.cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-firestore for an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants