feat(orch): debug a sandbox guest kernel with resume-build -gdb#3040
feat(orch): debug a sandbox guest kernel with resume-build -gdb#3040kalyazin wants to merge 5 commits into
Conversation
Add Config.SkipEnvdWait, gating the post-resume WaitForEnvd in ResumeSandbox. The resume-build gdb debugging flow needs it: the guest is held at a gdb entry breakpoint and never boots envd, so the readiness wait would time out and tear the sandbox down before a debugger can attach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Reusable gdb macros (promoted from the restore-path PoC) loaded by
resume-build -gdb: fc-faults [N] attributes guest page faults to comm/pid/VMA,
plus fc-task, fc-curr, fc-regions, fc-va. Targets Linux 6.1.x x86_64.
Sample output (base template resumed under gdb on a dev node):
fc-faults 3:
FAULT comm=systemd-journal pid=283 addr=0x7f833ca73ea8 vma=0x7f833ca27000-0x7f833cb27000 flags=0xfb
FAULT comm=systemd-journal pid=283 addr=0x7f833e0c7088 vma=0x7f833e0c7000-0x7f833e0c8000 flags=0xfb
FAULT comm=systemd-journal pid=283 addr=0x7f833ca56488 vma=0x7f833ca27000-0x7f833cb27000 flags=0xfb
fc-curr 0 / fc-curr 1 (per-vCPU current task):
task=0xffffffff82419600 comm=swapper/0 pid=0 tgid=0 mm=0x0
task=0xffff888003dad800 comm=systemd-journal pid=283 tgid=283 mm=0xffff888002c62a80
fc-regions / fc-va 0x100000:
page_offset_base = 0xffff888000000000
vmemmap_base = 0xffffea0000000000
vmalloc_base = 0xffffc90000000000
__va(0x100000) = 0xffff888000100000
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
PR SummaryMedium Risk Overview ResumeSandbox gains Config.SkipEnvdWait so the post-resume envd readiness wait is skipped when the guest never boots envd under gdb. Checked-in fc-debug.gdb macros and gdb-debugging.md document fault attribution and dev-only customer-data constraints. create-build reads PROXY_PORT (default 5007) so local builds can share a node with a running orchestrator proxy. Reviewed by Cursor Bugbot for commit 89df3d2. Bugbot is set up for automated code reviews on this repo. Configure here. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
Using time.After inside a select loop in waitForSocket creates a new timer on every iteration, which can cause a significant resource leak of timers; replacing it with a reused time.Ticker avoids this issue. Additionally, the write operation to the temporary GDB init script file in writeInitScript does not check for errors, which can lead to silent failures or truncated scripts if the disk is full or write permissions are restricted.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func waitForSocket(ctx context.Context, path string, timeout time.Duration) error { | ||
| deadline := time.Now().Add(timeout) | ||
| for { | ||
| if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 { | ||
| return nil | ||
| } | ||
| if time.Now().After(deadline) { | ||
| return errors.New("timeout") | ||
| } | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| case <-time.After(50 * time.Millisecond): | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Using time.After inside a select loop creates a new timer on every iteration, which can cause a significant resource leak of timers if GDB takes a while to connect or if the context is canceled. Replacing this with a reused time.Ticker avoids leaking timers and ensures efficient resource usage.
| func waitForSocket(ctx context.Context, path string, timeout time.Duration) error { | |
| deadline := time.Now().Add(timeout) | |
| for { | |
| if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 { | |
| return nil | |
| } | |
| if time.Now().After(deadline) { | |
| return errors.New("timeout") | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case <-time.After(50 * time.Millisecond): | |
| } | |
| } | |
| } | |
| func waitForSocket(ctx context.Context, path string, timeout time.Duration) error { | |
| deadline := time.Now().Add(timeout) | |
| ticker := time.NewTicker(50 * time.Millisecond) | |
| defer ticker.Stop() | |
| for { | |
| if info, err := os.Stat(path); err == nil && info.Mode()&os.ModeSocket != 0 { | |
| return nil | |
| } | |
| if time.Now().After(deadline) { | |
| return errors.New("timeout") | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case <-ticker.C: | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Good catch — switched waitForSocket to a single reused time.Ticker (created once, defer ticker.Stop()) instead of allocating via time.After each iteration. Fixed in 2e27ece.
| func writeInitScript(symbols, socket string) (string, error) { | ||
| macroLib, err := macroLibPath() | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| f, err := os.CreateTemp("", "fc-debug-init-*.gdb") | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| defer f.Close() | ||
| fmt.Fprintf(f, `set pagination off | ||
| set confirm off | ||
| source %s | ||
| # FC boots the uncompressed vmlinux ELF directly, so KASLR never relocates the image: | ||
| # symbols sit at their link-time addresses (offset 0). | ||
| add-symbol-file %s -o 0x0 | ||
| target remote %s | ||
| `, macroLib, symbols, socket) | ||
|
|
||
| return f.Name(), nil | ||
| } |
There was a problem hiding this comment.
The write operation to the temporary GDB init script file does not check for errors, which can lead to silent failures or truncated scripts if the disk is full or write permissions are restricted. Explicitly checking the write error and closing the file descriptor before returning ensures robust error handling.
func writeInitScript(symbols, socket string) (string, error) {
macroLib, err := macroLibPath()
if err != nil {
return "", err
}
f, err := os.CreateTemp("", "fc-debug-init-*.gdb")
if err != nil {
return "", err
}
defer f.Close()
_, err = fmt.Fprintf(f, "set pagination off\nset confirm off\nsource %s\n# FC boots the uncompressed vmlinux ELF directly, so KASLR never relocates the image:\n# symbols sit at their link-time addresses (offset 0).\nadd-symbol-file %s -o 0x0\ntarget remote %s\n", macroLib, symbols, socket)
if err != nil {
return "", err
}
if err := f.Close(); err != nil {
return "", err
}
return f.Name(), nil
}There was a problem hiding this comment.
Fixed in 2e27ece — writeInitScript now checks the Fprintf write error and the explicit Close error, and removes the temp file on the error path (so a failed write does not leave a truncated init script behind).
c2dedd5 to
8ab8d03
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8ab8d03. Configure here.
Resume a snapshot under a gdb-enabled Firecracker held at the kernel entry breakpoint and hand over a ready, symbol-resolving gdb session. Sets FIRECRACKER_GDB_SOCKET (inherited by the no-jailer FC), skips the envd wait, stages the debug FC binary at the resolved path (restored on exit), resumes in the background and connects gdb once FC binds the socket (the stub holds the snapshot load open until a debugger attaches, so resuming first would deadlock), generates a parameterized init script sourcing fc-debug.gdb, prints a debug-context block, and runs gdb interactively or in batch (-gdb-exec/-gdb-script). One session per invocation; FC/UFFD/NBD torn down on exit. The debug artifacts (firecracker-debug, vmlinux.debug) are fetched by version from the release buckets the same way create-build fetches the prod kernel/FC (base overridable via E2B_GDB_ARTIFACTS_URL); -gdb-fc / -gdb-symbols override a fetch with a local build. Symbols load at their link-time addresses (offset 0): Firecracker boots the uncompressed vmlinux ELF directly, so the bzImage KASLR decompressor never runs and the kernel image is never relocated — there is no slide to recover. This holds even with CONFIG_RANDOMIZE_BASE / CONFIG_RANDOMIZE_MEMORY enabled, since both are gated on a boot-params flag that only the decompressor sets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
How to debug a sandbox's guest kernel with resume-build -gdb: prerequisites and artifact staging, steps (interactive + scripted), the macro reference, the observer-effect guidance (HW breakpoints, resident set from the UFFD log), and the binding customer-data rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
create-build hardcoded the sandbox proxy port to 5007, so it could not run on a node that already has a live orchestrator bound there. Read PROXY_PORT (default 5007) so create-build can run alongside a live orchestrator on an alternate port, as the gdb e2e does. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8ab8d03 to
89df3d2
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |

Why
Host- and UFFD-side telemetry can't show what happens inside a resumed guest — which process/VMA is faulting, kernel state during the resume. This adds a way to attach
gdbto a resumed sandbox guest with source-level kernel symbols, for diagnosing resume behaviour (page-fault attribution, scheduler/VM state) on a dev node.What
resume-build -gdbresumes a snapshot under a--features gdbFirecracker held at the kernel entry breakpoint, loads the guest kernel's DWARF symbols, and hands over a readygdbsession (interactive, or scripted via-gdb-exec/-gdb-script). Commit-by-commit:feat(orch): allow skipping the envd readiness wait on resume—Config.SkipEnvdWaitgates the post-resumeWaitForEnvdinResumeSandbox(the guest is held at the breakpoint and never boots envd).feat(orch): add fc-debug.gdb guest-kernel debugging macros— reusable gdb macros:fc-faults [N](attributes guest page faults tocomm/pid/VMA),fc-curr,fc-task,fc-regions,fc-va. Targets Linux 6.1.x x86_64.feat(orch): add resume-build -gdb …— the orchestration: armsFIRECRACKER_GDB_SOCKET, stages the debug FC binary, resumes in the background and connects gdb once FC binds the socket (the stub holds the snapshot load open until a debugger attaches — resuming first would deadlock), generates the init script, prints a debug-context block, drives gdb. Debug artifacts (firecracker-debug,vmlinux.debug) are fetched by version frome2b-prod-public-builds(override base viaE2B_GDB_ARTIFACTS_URL; override paths via-gdb-fc/-gdb-symbols). Symbols load at offset 0 — FC boots the uncompressedvmlinuxELF directly, so image KASLR never runs and there's no slide to recover.docs(orch): add … runbook—gdb-debugging.md(usage, macros, observer-effect notes, customer-data rules).feat(orch): honor PROXY_PORT in create-build— letscreate-buildrun alongside a live orchestrator on an alternate port (needed to build the test snapshot on a node).Validation
--features gdbFC on the cluster,create-buildcold-booted the DWARF kernel into a snapshot, thenresume-build -gdb(no flags) fetched both artifacts by version, recovered a ready session, and produced real fault attribution:FAULT comm=systemd-journal pid=283 addr=0x7f83… vma=0x7f83…-… flags=0xfbfc-curr→comm=systemd-journal pid=283 …;fc-regions→page_offset_base=0xffff888000000000 …debugArtifactsBaseURL,resolveOrFetch(override→local→fetch precedence, 404),proxyPortparsing/fallback.golangci-lintclean; every commit builds and passes tests on its own.🤖 Generated with Claude Code