Iceberg full table corruption fix by thisisArjit · Pull Request #4201 · apache/gobblin

thisisArjit · 2026-06-16T08:58:46Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-XXX

Description

Here are some details about my PR, including screenshots (if applicable):

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…ata on publish Two related fixes for inter-cluster Iceberg full-table copy: 1. Determine which files to copy by diffing the source table against what the DESTINATION Iceberg catalog references, rather than probing each file's presence on the destination filesystem (config-gated, default on: iceberg.dataset.determine.copy.from.dest.catalog). Because the dest table is committed only after a fully successful publish, every path it references is guaranteed present and consistent, while orphan metadata left on the dest filesystem by a prior partially-failed run is NOT referenced -- so its missing data files are still re-copied. This prevents the table corruption that filesystem-presence short-circuiting can cause (a metadata file present on the dest FS skips its whole subtree forever, leaving the committed table referencing data files that never get copied). 2. On publish, rename data files before metadata files (HadoopUtils.renameRecursivelyOrdered) so Iceberg metadata is never committed ahead of the data it references. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thisisArjit and others added 2 commits June 16, 2026 12:24

Propagate job properties to Work units

2b83227

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg full table corruption fix#4201

Iceberg full table corruption fix#4201
thisisArjit wants to merge 2 commits into
apache:masterfrom
thisisArjit:master

thisisArjit commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thisisArjit commented Jun 16, 2026

JIRA

Description

Tests

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant