Skip to content

Iceberg full table corruption fix#4201

Draft
thisisArjit wants to merge 2 commits into
apache:masterfrom
thisisArjit:master
Draft

Iceberg full table corruption fix#4201
thisisArjit wants to merge 2 commits into
apache:masterfrom
thisisArjit:master

Conversation

@thisisArjit

Copy link
Copy Markdown
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

thisisArjit and others added 2 commits June 16, 2026 12:24
…ata on publish

Two related fixes for inter-cluster Iceberg full-table copy:

1. Determine which files to copy by diffing the source table against what the
   DESTINATION Iceberg catalog references, rather than probing each file's
   presence on the destination filesystem (config-gated, default on:
   iceberg.dataset.determine.copy.from.dest.catalog). Because the dest table is
   committed only after a fully successful publish, every path it references is
   guaranteed present and consistent, while orphan metadata left on the dest
   filesystem by a prior partially-failed run is NOT referenced -- so its
   missing data files are still re-copied. This prevents the table corruption
   that filesystem-presence short-circuiting can cause (a metadata file present
   on the dest FS skips its whole subtree forever, leaving the committed table
   referencing data files that never get copied).

2. On publish, rename data files before metadata files
   (HadoopUtils.renameRecursivelyOrdered) so Iceberg metadata is never
   committed ahead of the data it references.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant