Skip to content

GOBBLIN-2265: AM waits for workflow terminal state before exit#4200

Merged
Blazer-007 merged 1 commit into
apache:masterfrom
pratapaditya04:pratapaditya04/gobblin-2265-am-await-workflow-completion
Jun 16, 2026
Merged

GOBBLIN-2265: AM waits for workflow terminal state before exit#4200
Blazer-007 merged 1 commit into
apache:masterfrom
pratapaditya04:pratapaditya04/gobblin-2265-am-await-workflow-completion

Conversation

@pratapaditya04

Copy link
Copy Markdown
Contributor

Summary

Fixes a regression where a successful temporal-on-YARN job that runs longer than the AM's service-start timeout (app.start.waitForServicesTimeout.seconds, e.g. 300s) is reported CANCELLED end-to-end. Follow-up to #4197 (same JIRA, GOBBLIN-2265).

Problem

The temporal AM runs its single job synchronously inside a Guava service's startUp(): FsJobConfigurationManager posts the job-config event on the EventBus, which GobblinTemporalJobScheduler handles by calling launchJob(), blocking until the workflow finishes. The service therefore stays STARTING for the whole job, so ServiceBasedAppLauncher#start() returns once app.start.waitForServicesTimeout elapses ("Timeout of N seconds exceeded ... Proceeding anyway") even while the workflow is still running.

GobblinTemporalApplicationMaster.main() then immediately computed the exit code and System.exit()ed, which:

  1. fired the JVM shutdown hook registered in GobblinTemporalJobScheduler.handleNewJobConfigArrival, calling launcher.cancelJob() → cancels the in-flight Temporal workflow (WORKFLOW_EXECUTION_STATUS_CANCELED), and
  2. un-registered with FinalApplicationStatus.KILLED (derived from the cancelled status).

The launcher mapped KILLED → JobCancelTimer → ExecutionStatus.CANCELLED, so any job that outran the service-start timeout was force-cancelled and reported CANCELLED even though it would have completed successfully. (close() likewise un-registered the premature null → FAILED status.) Before #4197 the AM un-registered a hardcoded SUCCEEDED, which masked this; #4197 surfaced the true status and thereby exposed the pre-existing premature teardown.

Confirmed on prod from AM logs: the ServiceBasedAppLauncher "Timeout of 300 seconds exceeded" line fires exactly at start+300s, immediately followed by Cancelling temporal workflow and Derived FinalApplicationStatus KILLED. A sibling job that finished in ~3.5 min (under 300s) captured COMPLETED and un-registered SUCCEEDED.

Changes

  • GobblinTemporalApplicationMaster.main() now waits for the workflow to actually reach a terminal state (GobblinTemporalJobLauncher#awaitTerminalStatus) before the AM closes / un-registers / exits. The wait sits inside the try-with-resources, so close()/un-register run only once the captured status is real. The job runs on a non-daemon thread, so the wait cannot deadlock.
  • GobblinTemporalJobLauncher#awaitTerminalStatus(maxWaitMillis, pollMillis) — polls the process-wide terminal-status cache.
  • A single generous, configurable safety cap gobblin.temporal.am.workflow.completion.wait.timeout.minutes (default 1440) backstops a genuinely wedged workflow. The real bound on job runtime remains the GaaS flow SLA (gobblin.flow.sla.time), which cancels overruns and thereby unblocks the wait.

Testing

  • Unit: new awaitTerminalStatus tests (returns immediately when already captured; waits for a late-arriving terminal status; times out to null when never terminal) + existing GobblinTemporalJobLauncherTest all pass.
  • E2E: validated via a snapshot gobblin build wired into gobblin-temporal-workers and a carbon copy flow on prod-ltx1 — results appended below.

@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/gobblin-2265-am-await-workflow-completion branch 2 times, most recently from f1e97dd to 1b7dfd5 Compare June 15, 2026 17:28
@pratapaditya04 pratapaditya04 changed the title GOBBLIN-2265: AM waits for workflow terminal state before exit (fix spurious CANCELLED for long-running temporal jobs) GOBBLIN-2265: AM waits for workflow terminal state before exit Jun 16, 2026
…fix spurious CANCELLED on long jobs)

The temporal AM runs its single job synchronously inside a Guava service startUp(),
so the service stays STARTING for the whole job and ServiceBasedAppLauncher#start()
returns once app.start.waitForServicesTimeout (the service-start "healthy" timeout,
e.g. 300s) elapses even while the workflow is still running. main() then computed the
exit code and System.exit'd immediately, firing the JVM shutdown hook (-> cancelJob ->
WORKFLOW_EXECUTION_STATUS_CANCELED) and un-registering the not-yet-terminal status, so
a successful job longer than that timeout was reported KILLED -> JobCancelTimer ->
CANCELLED end-to-end.

Fix: main() now waits for the application to actually STOP via
ServiceBasedAppLauncher#awaitStopped() -- every managed service, including YarnService,
reaching a terminal state -- before computing the exit code and exiting. Because
YarnService is one of those services, this returns only after the un-register has
completed (no race with close()). The application reaches "stopped" only once the
workflow finishes and the shutdown it triggers (ClusterManagerShutdownRequest ->
stop()) completes; the job runs on a non-daemon thread so the wait cannot deadlock. The
wait is bounded by the flow SLA (gobblin.flow.sla.time), which the GaaS control plane
cancels on overrun, thereby unblocking the wait.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/gobblin-2265-am-await-workflow-completion branch from 1b7dfd5 to cefef2d Compare June 16, 2026 03:19
@Blazer-007 Blazer-007 merged commit 0091997 into apache:master Jun 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants