Skip to content

RTL8814AU: drop REG_CR=0 post-fwdl write that wedges bulk-OUT#49

Merged
josephnef merged 1 commit into
masterfrom
fix/issue-36-reg-cr-zero-wedge
May 26, 2026
Merged

RTL8814AU: drop REG_CR=0 post-fwdl write that wedges bulk-OUT#49
josephnef merged 1 commit into
masterfrom
fix/issue-36-reg-cr-zero-wedge

Conversation

@josephnef
Copy link
Copy Markdown
Collaborator

Summary

  • FirmwareDownload_8814A was writing REG_CR (0x0100) = 0 immediately after MCUFWDL=0x79. This clears all 8 enable bits in byte 0 — including the DMA-enable bits (0..5).
  • The later REG_CR |= MACTXEN | MACRXEN at HalModule.cpp:241 is a 2-bit OR; it sets bits 6+7 but leaves bits 0..5 at zero. So the chip's TX/RX DMA engines never come up: bulk-OUT URBs queue at EP 0x02 but the FIFO has no drain path. URBs sit at the chip until libusb's 500 ms async timeout cancels them (-ENOENT), giving the catastrophic submit-failure pattern reported in RTL8814AU: devourer TX degrades to LIBUSB_ERROR_IO after USB passthrough cycles #36.
  • Kernel rtw88_8814au never writes REG_CR=0 during post-fwdl. The "byte-for-byte rtw88-mirror" comment block above this code is wrong on this specific address.
  • Bisected by gating each of the 7 divergent post-fwdl writes (0x010d, 0x0100, 0x1330, 0x0230, 0x022c, REG_BCN_CTRL, 0x0210) behind env vars; only 0x0100 reproduces the wedge.
  • See #36 comment with the full bisect ladder + per-write data.

Scope

  • This resolves RTL8814AU: devourer TX degrades to LIBUSB_ERROR_IO after USB passthrough cycles #36 (catastrophic LIBUSB_TRANSFER_TIMED_OUT submit failures on devourer-TX 8814AU after USB cycling).
  • It does not restore 8814AU on-air emission — URBs complete cleanly now but no frames hit the air. That is a separate gate (likely TX-descriptor / rate-config) and out of scope here; will track separately.
  • RTL8814AU devourer-RX in matrix is also still broken (cells 11/12/19/20/23/24 = 0 hits) — pre-existing, unrelated.

Test plan

  • Local WiFiDriverTxDemo 12 s on 0bda:8813: 2203/2203 OK, 0 fail (was 815 submits / 575 fail = 0.4% completion on master).
  • RTL8812AU WiFiDriverTxDemo sanity: 796/796/0 unchanged (different code path).
  • RTL8821AU WiFiDriverTxDemo sanity: 991/991/0 unchanged (different code path).
  • sudo python3 tests/regress.py --full-matrix --channel 100 --vm-name devourer-testrig --vm-ssh josephnef@... (the original RTL8814AU: devourer TX degrades to LIBUSB_ERROR_IO after USB passthrough cycles #36 repro): 8814 devourer-TX cells [2,4,6,8] now show 0 hits / 4500 TX (no (N fail) annotation, indicating tx_failures == 0 per regress.py:494-495). Before fix: each cell showed (4700+ fail). 8812/8821 devourer-TX cells unchanged (5927–6884 hits, identical to pre-fix).
  • CI matrix builds (GCC/Clang/MSVC on Ubuntu/macOS/Windows) — should be unaffected since this is a single-line removal in a 8814-only code path.

🤖 Generated with Claude Code

FirmwareDownload_8814A's post-fwdl CPU kick zeroes REG_CR (0x0100) just
after MCUFWDL=0x79. This clears all 8 enable bits in byte 0 (HCI TX/RX
DMA, TXDMA, RXDMA, PROTOCOL, SCHEDULE, MACTXEN, MACRXEN). The later
`REG_CR |= MACTXEN|MACRXEN` at HalModule.cpp:241 only re-sets bits 6+7,
leaving the DMA-enable bits 0..5 at zero — so the chip's TX/RX DMA
engines never come up. bulk-OUT URBs queue at EP 0x02 but the FIFO
never drains; URBs sit until libusb's 500 ms async timeout cancels
them (-ENOENT), producing the catastrophic submit-failure pattern
reported in #36.

Kernel rtw88_8814au never writes REG_CR=0 during post-fwdl. The
"byte-for-byte rtw88-mirror" comment block above this code was wrong
about this specific address.

Bisected today by gating the 7 divergent post-fwdl writes individually
behind env vars; only 0x0100 reproduces the wedge.

Verification:
- Local devourer-TX 12 s on 8814AU: 2203/2203 OK (was 0.4% completion)
- 8812AU + 8821AU sanity: unchanged (different code path)
- tests/regress.py --full-matrix: 8814 devourer-TX cells [2,4,6,8]
  now show 0 fail annotation (was 4700+ failures each)

The fix is sufficient for #36 but does not restore 8814AU on-air
emission — chips ACK URBs cleanly but no frames hit air. That is a
separate gate (TX descriptor or rate config) and out of scope here.

Closes #36.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@josephnef josephnef merged commit 5b43870 into master May 26, 2026
5 checks passed
@josephnef josephnef deleted the fix/issue-36-reg-cr-zero-wedge branch May 26, 2026 14:28
josephnef added a commit that referenced this pull request May 26, 2026
In RtlJaguarDevice::send_packet the SET_TX_DESC_*_8812 macros are
bit-identical to the SET_TX_DESC_*_8814A macros (verified against
hal/rtl8814a_xmit.h), so devourer can keep using the 8812 macro set
on 8814A. But a usbmon byte-diff against a working VM-passthrough
88XXau monitor-injection session (qemu USB-host-passthrough → VM
kernel 88XXau → bulk-OUT URBs back through host xhci) shows three
field-value mismatches on 8814A:

  Dword 0 bit 31 — 8812 calls it OWN, 8814A calls it DISQSELSEQ.
    88XXau leaves bit 31 = 0 for monitor-injected frames; devourer's
    SET_TX_DESC_OWN_8812(..., 1) sets it to 1, which on 8814A means
    DISQSELSEQ=1 (disable queue-select-based sequence numbering).
  Dword 2 bits 24-29 (GID) — 88XXau leaves at 0 for injection;
    devourer writes 0x3F.
  Dword 4 bits 18-23 (DATA_RETRY_LIMIT) — 88XXau leaves at 0 for
    injection; devourer writes 12 (RETRY_LIMIT_ENABLE stays 1 in both).

Skip those writes on 8814A so the emitted descriptor byte-matches
aircrack-ng's reference monitor-injection format. Add a
DEVOURER_TX_LEGACY_8812_DESC=1 env-gate to restore the old behaviour
without rebuilding, in case anything downstream depends on it.

This does NOT resolve #50 (8814AU on-air silence has a separate root
cause that vendor-control-write replay cannot reach — both sessions on
2026-05-26 ruled out 9 distinct hypotheses including a binary
URB-flag diff, see comment-4546974748). The change is purely about
descriptor correctness — aligning devourer's TX descriptor format
with the byte-level reference that the working kernel driver produces.

8812AU and 8821AU paths are bit-for-bit identical to current master
(is_8814a is false there and all writes fire as before). Smoke-tested
on the live bench:

  8812AU: 760 submits / 760 complete / 0 fail
  8814AU (new): 3572 submits / 3572 complete / 0 fail (vs current
                master's behaviour, which is identical at libusb level
                because devourer's descriptor differences from 88XXau
                are no-ops at the bulk-OUT path post-PR-#49)
  8814AU (DEVOURER_TX_LEGACY_8812_DESC=1): same as without env

Refs #50 (partial — descriptor alignment only, not the on-air gate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
josephnef added a commit that referenced this pull request May 26, 2026
In RtlJaguarDevice::send_packet the SET_TX_DESC_*_8812 macros are
bit-identical to the SET_TX_DESC_*_8814A macros (verified against
hal/rtl8814a_xmit.h), so devourer can keep using the 8812 macro set
on 8814A. But a usbmon byte-diff against a working VM-passthrough
88XXau monitor-injection session (qemu USB-host-passthrough → VM
kernel 88XXau → bulk-OUT URBs back through host xhci) shows three
field-value mismatches on 8814A:

  Dword 0 bit 31 — 8812 calls it OWN, 8814A calls it DISQSELSEQ.
    88XXau leaves bit 31 = 0 for monitor-injected frames; devourer's
    SET_TX_DESC_OWN_8812(..., 1) sets it to 1, which on 8814A means
    DISQSELSEQ=1 (disable queue-select-based sequence numbering).
  Dword 2 bits 24-29 (GID) — 88XXau leaves at 0 for injection;
    devourer writes 0x3F.
  Dword 4 bits 18-23 (DATA_RETRY_LIMIT) — 88XXau leaves at 0 for
    injection; devourer writes 12 (RETRY_LIMIT_ENABLE stays 1 in both).

Skip those writes on 8814A so the emitted descriptor byte-matches
aircrack-ng's reference monitor-injection format. Add a
DEVOURER_TX_LEGACY_8812_DESC=1 env-gate to restore the old behaviour
without rebuilding, in case anything downstream depends on it.

This does NOT resolve #50 (8814AU on-air silence has a separate root
cause that vendor-control-write replay cannot reach — both sessions on
2026-05-26 ruled out 9 distinct hypotheses including a binary
URB-flag diff, see comment-4546974748). The change is purely about
descriptor correctness — aligning devourer's TX descriptor format
with the byte-level reference that the working kernel driver produces.

8812AU and 8821AU paths are bit-for-bit identical to current master
(is_8814a is false there and all writes fire as before). Smoke-tested
on the live bench:

  8812AU: 760 submits / 760 complete / 0 fail
  8814AU (new): 3572 submits / 3572 complete / 0 fail (vs current
                master's behaviour, which is identical at libusb level
                because devourer's descriptor differences from 88XXau
                are no-ops at the bulk-OUT path post-PR-#49)
  8814AU (DEVOURER_TX_LEGACY_8812_DESC=1): same as without env

Refs #50 (partial — descriptor alignment only, not the on-air gate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
josephnef added a commit that referenced this pull request May 30, 2026
…#58)

## Summary

`tests/regress.py --channel` defaulted to `36` (5GHz UNII-1), and every
matrix invocation in `README.md` + `CLAUDE.md` examples used `--channel
100` (5GHz UNII-2-extended). This hid a long-standing fact: **devourer's
5GHz code path has broken cells for 8814 RX, 8821 TX, and 8821 RX that
all pass at 2.4GHz**. The "RTL8814AU... RX solid" line in `CLAUDE.md`
was correct AT 2.4GHz but appeared to contradict matrix output captured
at 5GHz — which is why PR bodies #34, #42, and #49 all record "8814 RX
devourer still broken" but those cells work fine at ch6.

## What this changes

- `tests/regress.py` — default `--channel` → `6`. Help text spells out
that 5GHz has known broken cells.
- `tests/README.md` — example invocations drop the explicit `--channel
100`. Added a "Channel / band asymmetry" entry to Known Limitations.
- `CLAUDE.md` — regress.py examples drop `--channel 100`. Adds a
paragraph explaining the band asymmetry.

## What this does NOT change

- The actual 5GHz code-path issues — separate investigation (follow-up
PR will tackle 8814 RX at 5G, 8821 TX/RX at 5G).
- The persistent 8814AU TX gate — 0 hits at both bands; unchanged.
- The 8812AU code paths, which work at both bands.

## Empirical evidence — single-pair matrix at both bands, master
`9e5287e` post-PR-57

VM mode (`devourer-testrig` + `aircrack-ng/88XXau`), 12s per cell,
`--no-baseline-abort`.

### TX=8812, RX=8814

| cell | ch100 | ch6 |
|---|---|---|
| kernel TX → kernel RX (baseline)   | 292 ✓ | 339 ✓ |
| devourer TX → kernel RX            | 4839 ✓ | 5279 ✓ |
| **kernel TX → devourer RX**        | **0 ✗** | **300 ✓** |
| **devourer TX → devourer RX**      | **0 ✗** | **5500 ✓** |

### TX=8821, RX=8812

| cell | ch100 | ch6 |
|---|---|---|
| kernel TX → kernel RX (baseline)   | 108 ✓ | 336 ✓ |
| **devourer TX → kernel RX**        | **0 ✗** | **5544 ✓** |
| kernel TX → devourer RX            | 100 ✓ | 300 ✓ |
| **devourer TX → devourer RX**      | **0 ✗ (105 fail)** | **5500 ✓** |

### TX=8812, RX=8821

| cell | ch100 (extrapolated from full-matrix) | ch6 |
|---|---|---|
| kernel TX → kernel RX (baseline)   | 348 ✓ | 345 ✓ |
| devourer TX → kernel RX            | 5517 ✓ | 5279 ✓ |
| **kernel TX → devourer RX**        | **0 ✗** | **300 ✓** |
| **devourer TX → devourer RX**      | **0 ✗** | **5200 ✓** |

### TX=8814, RX=anything (8814 TX gate — broken on both bands)

`0 hits` at both ch100 and ch6 for every cell where devourer TX is on
8814AU. Pre-existing gate, not addressed here. See kaeru cite `RTL8814AU
libusb-userspace bulk-OUT does not produce on-air TX`.

## Why ch6 as default

- The OpenIPC long-range-video use case typically runs at 2.4GHz.
- Out-of-the-box matrix runs should pass for the chips that work —
otherwise contributors get false-failure noise.
- The 5GHz issues are real but separate; the new help text + Known
Limitations entry tell users how to surface them deliberately.

## Test plan

- [x] `python3 -c 'import tests.regress'` clean import
- [x] `python3 tests/regress.py --help` renders the new help text
- [x] Single-pair matrix at `--channel 6` runs end-to-end and passes for
8812/8821 chip combos (table above)
- [x] Single-pair matrix at `--channel 100` reproduces the historical
5GHz broken cells (table above)
- [x] `--full-matrix --channel 100` matches prior PR bodies' tables
(confirms the change doesn't alter 5GHz behavior — it only flips the
default)

## Follow-up

Separate PR will investigate why devourer's 5GHz path is broken for 8814
RX / 8821 TX / 8821 RX. Probably a band-switch register sequence missing
somewhere in `RadioManagementModule::PHY_SwitchWirelessBand8812` or the
per-channel BB setup. Saved as kaeru cite `devourer 5GHz vs 2.4GHz cell
asymmetry — matrix --channel 100 default hides working 2.4G state` for
the next session.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RTL8814AU: devourer TX degrades to LIBUSB_ERROR_IO after USB passthrough cycles

1 participant