GOBBLIN-2265: AM waits for workflow terminal state before exit#4200
Merged
Blazer-007 merged 1 commit intoJun 16, 2026
Conversation
f1e97dd to
1b7dfd5
Compare
Blazer-007
approved these changes
Jun 16, 2026
…fix spurious CANCELLED on long jobs) The temporal AM runs its single job synchronously inside a Guava service startUp(), so the service stays STARTING for the whole job and ServiceBasedAppLauncher#start() returns once app.start.waitForServicesTimeout (the service-start "healthy" timeout, e.g. 300s) elapses even while the workflow is still running. main() then computed the exit code and System.exit'd immediately, firing the JVM shutdown hook (-> cancelJob -> WORKFLOW_EXECUTION_STATUS_CANCELED) and un-registering the not-yet-terminal status, so a successful job longer than that timeout was reported KILLED -> JobCancelTimer -> CANCELLED end-to-end. Fix: main() now waits for the application to actually STOP via ServiceBasedAppLauncher#awaitStopped() -- every managed service, including YarnService, reaching a terminal state -- before computing the exit code and exiting. Because YarnService is one of those services, this returns only after the un-register has completed (no race with close()). The application reaches "stopped" only once the workflow finishes and the shutdown it triggers (ClusterManagerShutdownRequest -> stop()) completes; the job runs on a non-daemon thread so the wait cannot deadlock. The wait is bounded by the flow SLA (gobblin.flow.sla.time), which the GaaS control plane cancels on overrun, thereby unblocking the wait. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1b7dfd5 to
cefef2d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a regression where a successful temporal-on-YARN job that runs longer than the AM's service-start timeout (
app.start.waitForServicesTimeout.seconds, e.g. 300s) is reported CANCELLED end-to-end. Follow-up to #4197 (same JIRA, GOBBLIN-2265).Problem
The temporal AM runs its single job synchronously inside a Guava service's
startUp():FsJobConfigurationManagerposts the job-config event on the EventBus, whichGobblinTemporalJobSchedulerhandles by callinglaunchJob(), blocking until the workflow finishes. The service therefore staysSTARTINGfor the whole job, soServiceBasedAppLauncher#start()returns onceapp.start.waitForServicesTimeoutelapses ("Timeout of N seconds exceeded ... Proceeding anyway") even while the workflow is still running.GobblinTemporalApplicationMaster.main()then immediately computed the exit code andSystem.exit()ed, which:GobblinTemporalJobScheduler.handleNewJobConfigArrival, callinglauncher.cancelJob()→ cancels the in-flight Temporal workflow (WORKFLOW_EXECUTION_STATUS_CANCELED), andFinalApplicationStatus.KILLED(derived from the cancelled status).The launcher mapped
KILLED → JobCancelTimer → ExecutionStatus.CANCELLED, so any job that outran the service-start timeout was force-cancelled and reported CANCELLED even though it would have completed successfully. (close()likewise un-registered the prematurenull → FAILEDstatus.) Before #4197 the AM un-registered a hardcodedSUCCEEDED, which masked this; #4197 surfaced the true status and thereby exposed the pre-existing premature teardown.Confirmed on prod from AM logs: the
ServiceBasedAppLauncher"Timeout of 300 seconds exceeded"line fires exactly at start+300s, immediately followed byCancelling temporal workflowandDerived FinalApplicationStatus KILLED. A sibling job that finished in ~3.5 min (under 300s) capturedCOMPLETEDand un-registeredSUCCEEDED.Changes
GobblinTemporalApplicationMaster.main()now waits for the workflow to actually reach a terminal state (GobblinTemporalJobLauncher#awaitTerminalStatus) before the AM closes / un-registers / exits. The wait sits inside the try-with-resources, soclose()/un-register run only once the captured status is real. The job runs on a non-daemon thread, so the wait cannot deadlock.GobblinTemporalJobLauncher#awaitTerminalStatus(maxWaitMillis, pollMillis)— polls the process-wide terminal-status cache.gobblin.temporal.am.workflow.completion.wait.timeout.minutes(default 1440) backstops a genuinely wedged workflow. The real bound on job runtime remains the GaaS flow SLA (gobblin.flow.sla.time), which cancels overruns and thereby unblocks the wait.Testing
awaitTerminalStatustests (returns immediately when already captured; waits for a late-arriving terminal status; times out to null when never terminal) + existingGobblinTemporalJobLauncherTestall pass.gobblin-temporal-workersand acarboncopy flow on prod-ltx1 — results appended below.