Skip to content

feat(orch): debug a sandbox guest kernel with resume-build -gdb#3040

Open
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/resume-build-gdb
Open

feat(orch): debug a sandbox guest kernel with resume-build -gdb#3040
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/resume-build-gdb

Conversation

@kalyazin

Copy link
Copy Markdown
Contributor

Why

Host- and UFFD-side telemetry can't show what happens inside a resumed guest — which process/VMA is faulting, kernel state during the resume. This adds a way to attach gdb to a resumed sandbox guest with source-level kernel symbols, for diagnosing resume behaviour (page-fault attribution, scheduler/VM state) on a dev node.

What

resume-build -gdb resumes a snapshot under a --features gdb Firecracker held at the kernel entry breakpoint, loads the guest kernel's DWARF symbols, and hands over a ready gdb session (interactive, or scripted via -gdb-exec / -gdb-script). Commit-by-commit:

  • feat(orch): allow skipping the envd readiness wait on resumeConfig.SkipEnvdWait gates the post-resume WaitForEnvd in ResumeSandbox (the guest is held at the breakpoint and never boots envd).
  • feat(orch): add fc-debug.gdb guest-kernel debugging macros — reusable gdb macros: fc-faults [N] (attributes guest page faults to comm/pid/VMA), fc-curr, fc-task, fc-regions, fc-va. Targets Linux 6.1.x x86_64.
  • feat(orch): add resume-build -gdb … — the orchestration: arms FIRECRACKER_GDB_SOCKET, stages the debug FC binary, resumes in the background and connects gdb once FC binds the socket (the stub holds the snapshot load open until a debugger attaches — resuming first would deadlock), generates the init script, prints a debug-context block, drives gdb. Debug artifacts (firecracker-debug, vmlinux.debug) are fetched by version from e2b-prod-public-builds (override base via E2B_GDB_ARTIFACTS_URL; override paths via -gdb-fc / -gdb-symbols). Symbols load at offset 0 — FC boots the uncompressed vmlinux ELF directly, so image KASLR never runs and there's no slide to recover.
  • docs(orch): add … runbookgdb-debugging.md (usage, macros, observer-effect notes, customer-data rules).
  • feat(orch): honor PROXY_PORT in create-build — lets create-build run alongside a live orchestrator on an alternate port (needed to build the test snapshot on a node).

Depends on the fc-versions and fc-kernels release pipelines publishing firecracker-debug-<arch> and vmlinux.debug to the public builds bucket. Until those land, the fetch 404s and you build the artifacts and pass -gdb-fc / -gdb-symbols (or set E2B_GDB_ARTIFACTS_URL). Debug-only --features gdb Firecracker must never be promoted to a prod version.

Validation

  • End-to-end on the dev cluster: built a DWARF kernel + --features gdb FC on the cluster, create-build cold-booted the DWARF kernel into a snapshot, then resume-build -gdb (no flags) fetched both artifacts by version, recovered a ready session, and produced real fault attribution:
    • FAULT comm=systemd-journal pid=283 addr=0x7f83… vma=0x7f83…-… flags=0xfb
    • fc-currcomm=systemd-journal pid=283 …; fc-regionspage_offset_base=0xffff888000000000 …
  • Unit tests: debugArtifactsBaseURL, resolveOrFetch (override→local→fetch precedence, 404), proxyPort parsing/fallback.
  • golangci-lint clean; every commit builds and passes tests on its own.

🤖 Generated with Claude Code

kalyazin and others added 2 commits June 17, 2026 11:47
Add Config.SkipEnvdWait, gating the post-resume WaitForEnvd in ResumeSandbox.
The resume-build gdb debugging flow needs it: the guest is held at a gdb entry
breakpoint and never boots envd, so the readiness wait would time out and tear
the sandbox down before a debugger can attach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Reusable gdb macros (promoted from the restore-path PoC) loaded by
resume-build -gdb: fc-faults [N] attributes guest page faults to comm/pid/VMA,
plus fc-task, fc-curr, fc-regions, fc-va. Targets Linux 6.1.x x86_64.

Sample output (base template resumed under gdb on a dev node):

  fc-faults 3:
    FAULT comm=systemd-journal pid=283 addr=0x7f833ca73ea8 vma=0x7f833ca27000-0x7f833cb27000 flags=0xfb
    FAULT comm=systemd-journal pid=283 addr=0x7f833e0c7088 vma=0x7f833e0c7000-0x7f833e0c8000 flags=0xfb
    FAULT comm=systemd-journal pid=283 addr=0x7f833ca56488 vma=0x7f833ca27000-0x7f833cb27000 flags=0xfb

  fc-curr 0 / fc-curr 1 (per-vCPU current task):
    task=0xffffffff82419600 comm=swapper/0 pid=0 tgid=0 mm=0x0
    task=0xffff888003dad800 comm=systemd-journal pid=283 tgid=283 mm=0xffff888002c62a80

  fc-regions / fc-va 0x100000:
    page_offset_base = 0xffff888000000000
    vmemmap_base     = 0xffffea0000000000
    vmalloc_base     = 0xffffc90000000000
    __va(0x100000) = 0xffff888000100000

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot cla-bot Bot added the cla-signed label Jun 18, 2026
@cursor

cursor Bot commented Jun 18, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
ResumeSandbox behavior changes when SkipEnvdWait is set (sandbox marked running without envd), and gdb mode temporarily replaces the prod Firecracker binary on disk; both are dev-tooling paths but the resume hook is shared production code.

Overview
Adds resume-build -gdb so a snapshot can be resumed under a gdb-enabled Firecracker held at the kernel entry breakpoint, with guest vmlinux.debug symbols loaded and an interactive or batch gdb session (-gdb-exec / -gdb-script). The flow resolves firecracker-debug and vmlinux.debug by FC/kernel version (overridable via E2B_GDB_ARTIFACTS_URL or -gdb-fc / -gdb-symbols), stages the debug FC binary with restore-on-exit, arms FIRECRACKER_GDB_SOCKET, resumes in the background and attaches gdb once the stub socket appears to avoid deadlock, and rejects combining -gdb with pause/cmd/bench/shell modes.

ResumeSandbox gains Config.SkipEnvdWait so the post-resume envd readiness wait is skipped when the guest never boots envd under gdb. Checked-in fc-debug.gdb macros and gdb-debugging.md document fault attribution and dev-only customer-data constraints.

create-build reads PROXY_PORT (default 5007) so local builds can share a node with a running orchestrator proxy.

Reviewed by Cursor Bugbot for commit 89df3d2. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 10.93333% with 334 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/orchestrator/cmd/resume-build/gdb.go 10.39% 285 Missing and 8 partials ⚠️
packages/orchestrator/cmd/resume-build/main.go 0.00% 26 Missing ⚠️
packages/orchestrator/pkg/sandbox/sandbox.go 0.00% 14 Missing ⚠️
packages/orchestrator/cmd/create-build/main.go 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Using time.After inside a select loop in waitForSocket creates a new timer on every iteration, which can cause a significant resource leak of timers; replacing it with a reused time.Ticker avoids this issue. Additionally, the write operation to the temporary GDB init script file in writeInitScript does not check for errors, which can lead to silent failures or truncated scripts if the disk is full or write permissions are restricted.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +366 to +381
func waitForSocket(ctx context.Context, path string, timeout time.Duration) error {
deadline := time.Now().Add(timeout)
for {
if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 {
return nil
}
if time.Now().After(deadline) {
return errors.New("timeout")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(50 * time.Millisecond):
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using time.After inside a select loop creates a new timer on every iteration, which can cause a significant resource leak of timers if GDB takes a while to connect or if the context is canceled. Replacing this with a reused time.Ticker avoids leaking timers and ensures efficient resource usage.

Suggested change
func waitForSocket(ctx context.Context, path string, timeout time.Duration) error {
deadline := time.Now().Add(timeout)
for {
if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 {
return nil
}
if time.Now().After(deadline) {
return errors.New("timeout")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(50 * time.Millisecond):
}
}
}
func waitForSocket(ctx context.Context, path string, timeout time.Duration) error {
deadline := time.Now().Add(timeout)
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()
for {
if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 {
return nil
}
if time.Now().After(deadline) {
return errors.New("timeout")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
}
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — switched waitForSocket to a single reused time.Ticker (created once, defer ticker.Stop()) instead of allocating via time.After each iteration. Fixed in 2e27ece.

Comment on lines +272 to +292
func writeInitScript(symbols, socket string) (string, error) {
macroLib, err := macroLibPath()
if err != nil {
return "", err
}
f, err := os.CreateTemp("", "fc-debug-init-*.gdb")
if err != nil {
return "", err
}
defer f.Close()
fmt.Fprintf(f, `set pagination off
set confirm off
source %s
# FC boots the uncompressed vmlinux ELF directly, so KASLR never relocates the image:
# symbols sit at their link-time addresses (offset 0).
add-symbol-file %s -o 0x0
target remote %s
`, macroLib, symbols, socket)

return f.Name(), nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The write operation to the temporary GDB init script file does not check for errors, which can lead to silent failures or truncated scripts if the disk is full or write permissions are restricted. Explicitly checking the write error and closing the file descriptor before returning ensures robust error handling.

func writeInitScript(symbols, socket string) (string, error) {
	macroLib, err := macroLibPath()
	if err != nil {
		return "", err
	}
	f, err := os.CreateTemp("", "fc-debug-init-*.gdb")
	if err != nil {
		return "", err
	}
	defer f.Close()
	_, err = fmt.Fprintf(f, "set pagination off\nset confirm off\nsource %s\n# FC boots the uncompressed vmlinux ELF directly, so KASLR never relocates the image:\n# symbols sit at their link-time addresses (offset 0).\nadd-symbol-file %s -o 0x0\ntarget remote %s\n", macroLib, symbols, socket)
	if err != nil {
		return "", err
	}
	if err := f.Close(); err != nil {
		return "", err
	}

	return f.Name(), nil
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2e27ecewriteInitScript now checks the Fprintf write error and the explicit Close error, and removes the temp file on the error path (so a failed write does not leave a truncated init script behind).

Comment thread packages/orchestrator/cmd/resume-build/gdb.go
@kalyazin kalyazin force-pushed the kalyazin/resume-build-gdb branch 2 times, most recently from c2dedd5 to 8ab8d03 Compare June 18, 2026 15:07

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8ab8d03. Configure here.

Comment thread packages/orchestrator/cmd/resume-build/gdb.go
kalyazin and others added 3 commits June 18, 2026 16:15
Resume a snapshot under a gdb-enabled Firecracker held at the kernel entry
breakpoint and hand over a ready, symbol-resolving gdb session. Sets
FIRECRACKER_GDB_SOCKET (inherited by the no-jailer FC), skips the envd wait,
stages the debug FC binary at the resolved path (restored on exit), resumes in
the background and connects gdb once FC binds the socket (the stub holds the
snapshot load open until a debugger attaches, so resuming first would deadlock),
generates a parameterized init script sourcing fc-debug.gdb, prints a
debug-context block, and runs gdb interactively or in batch
(-gdb-exec/-gdb-script). One session per invocation; FC/UFFD/NBD torn down on
exit.

The debug artifacts (firecracker-debug, vmlinux.debug) are fetched by version
from the release buckets the same way create-build fetches the prod kernel/FC
(base overridable via E2B_GDB_ARTIFACTS_URL); -gdb-fc / -gdb-symbols override a
fetch with a local build.

Symbols load at their link-time addresses (offset 0): Firecracker boots the
uncompressed vmlinux ELF directly, so the bzImage KASLR decompressor never runs
and the kernel image is never relocated — there is no slide to recover. This
holds even with CONFIG_RANDOMIZE_BASE / CONFIG_RANDOMIZE_MEMORY enabled, since
both are gated on a boot-params flag that only the decompressor sets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
How to debug a sandbox's guest kernel with resume-build -gdb: prerequisites and
artifact staging, steps (interactive + scripted), the macro reference, the
observer-effect guidance (HW breakpoints, resident set from the UFFD log), and
the binding customer-data rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
create-build hardcoded the sandbox proxy port to 5007, so it could not
run on a node that already has a live orchestrator bound there. Read
PROXY_PORT (default 5007) so create-build can run alongside a live
orchestrator on an alternate port, as the gdb e2e does.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kalyazin kalyazin force-pushed the kalyazin/resume-build-gdb branch from 8ab8d03 to 89df3d2 Compare June 18, 2026 15:17
@kalyazin kalyazin marked this pull request as ready for review June 18, 2026 15:56
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant