[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901
Open
NaluTripician wants to merge 9 commits into
Open
[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901NaluTripician wants to merge 9 commits into
NaluTripician wants to merge 9 commits into
Conversation
…atch wiring Implements Phase 2 Priorities 1–3 of the cross-SDK hedging detection API work item (azure-sdk-for-python#46899). Adds three accessor methods — `is_hedging_started()`, `get_requested_regions()`, `get_responded_regions()` — to the five wrapper / exception types (`CosmosDict`, `CosmosList`, `CosmosItemPaged`, `CosmosAsyncItemPaged`, `CosmosHttpResponseError`, `CosmosBatchOperationError`, `CosmosClientTimeoutError`), backed by a shared private `_HedgingDetectionState`. New public types: * `azure.cosmos.RequestedRegion` (frozen slots dataclass) * `azure.cosmos.RequestedRegionReason` (non-exhaustive Enum with `_missing_` → `UNKNOWN` for forward compatibility — SE-016) Dispatch-site instrumentation: * INITIAL recorded inside the hedging handler arm body for index 0 (sync + async); INITIAL recorded at SynchronizedRequest entry for the non-hedged path. * HEDGING recorded inside the hedge-arm body AFTER the threshold delay and AFTER the cancellation check — pre-delay cancellation produces no phantom entry (AC10 / spec §12 ''no phantom entries''). * OPERATION_RETRY / REGION_FAILOVER recorded in `_retry_utility(_async).Execute` when the retry policy returns True; attached to exceptions before re-raise so error-path consumers see the full timeline. Closure-argument pattern (SE-002): the state flows through `execute_with_hedging` as an explicit `hedging_state` parameter (NOT on `request_params` — the deepcopy at line 96 of the handler would silently swallow child appends). Verified by AC8 regression test (to follow). Sync↔async parity (SE-004 / M11): every sync code site has a paired async modification. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds: * CHANGELOG entry under 4.16.0b3 (Features Added). * samples/hedging_detection.py + _async.py - customer-facing usage of the three new accessors on a CosmosDict response and a CosmosHttpResponseError. * tests/test_diagnostics_types.py - 9 unit tests covering AC6 (RequestedRegion frozen/equality/hash/pickle + RequestedRegionReason _missing_ -> UNKNOWN forward-compat per SE-016). * tests/test_hedging_detection.py - 21 unit tests covering AC1, AC3, AC4, AC5, AC8 (closure-arg deepcopy regression), AC10 (no phantom entries on cancelled hedge arm), AC11 (duplicate responses allowed), AC12 (reserved reasons never emitted in production code - repo-wide grep), plus headers invariant + threading.Lock concurrent-append stress test. * tests/test_hedging_detection_async.py - 23 async-twin tests including the sync<->async parity checker (AC9) that fails CI if any sync test gains a method without an _async counterpart. * tests/livecanary/ - opt-in (env-var gated) multi-region smoke tests covering AC13a (hedged read records HEDGING entry) and AC13b (all-regions-fail surfaces dispatch metadata on CosmosHttpResponseError). All 53 sync+async unit tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The parity-coverage test used cryptic short names `scls` and `acls` for `sync_cls` and `async_cls`. `scls` is flagged as a spelling error by cspell on the build-analyze CI job. Rename to the spelled-out form which is both more readable and cspell-clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address the 20 pylint findings flagged by the Build_Analyze CI job on the hedging-detection PR: * Remove no-op `try: ... except Exception: raise` blocks in `CrossRegionHedgingHandler.execute_single_request_with_delay` (sync + async). The bare reraise added nothing and triggered W0706 try-except-raise. The post-call `_record_response` logic still only runs on the success path, since an in-flight exception now propagates directly through the function. * Remove the unused `RequestedRegionReason` import from `aio/_retry_utility_async.py`. The async retry utility delegates retry-diagnostics recording to `_record_retry_diagnostics_and_attach` in the sync module, which owns the import. * Add full Sphinx-style docstrings (`:param`/`:type`/`:returns`/ `:rtype`) to every new hedging-detection internal flagged for missing doc elements: `CosmosItemPaged._copy_headers_stripped` (+ its async twin so they stay symmetric), `_HedgingDetectionState._record_request`, `_HedgingDetectionState._record_response`, `_HedgingDetectionAccessorsMixin.is_hedging_started`, `_HedgingDetectionAccessorsMixin.get_requested_regions`, `_HedgingDetectionAccessorsMixin.get_responded_regions`, `_attach_state_to_headers`, `_pop_state_from_headers`, `_record_retry_diagnostics_and_attach`, and `_resolve_region_name`. No public API surface change; only docstrings and the deletion of two inert try/except blocks. All 44 hedging-detection unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds the cross-SDK Hedging Detection API (Option B) to azure-cosmos: three accessor methods (is_hedging_started, get_requested_regions, get_responded_regions) backed by a private _HedgingDetectionState, plus two new public value types (RequestedRegion, RequestedRegionReason). State is propagated as an explicit closure argument through the sync/async hedging handlers and retry utility, and stashed on response headers under a private sentinel key that wrappers pop in __init__.
Changes:
- New
_diagnostics.py/_diagnostics_types.pymodules with the state object, accessor mixin, and value types; exported fromazure.cosmos. - Threaded
hedging_state=through sync + async availability-strategy handlers,_Request, retry utilities, and the seven public wrapper/exception types. - Added sync/async unit tests, a parity checker, opt-in live multi-region canary tests, and two samples.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| azure/cosmos/_diagnostics_types.py | New public RequestedRegion dataclass + non-exhaustive RequestedRegionReason enum with _missing_ → UNKNOWN. |
| azure/cosmos/_diagnostics.py | _HedgingDetectionState, accessor mixin, and helpers to attach/pop state on headers. |
| azure/cosmos/init.py | Exports the two new public types. |
| azure/cosmos/_cosmos_responses.py | Adds mixin to CosmosDict/CosmosList/CosmosItemPaged/CosmosAsyncItemPaged, pops sentinel from headers. |
| azure/cosmos/exceptions.py | Adds mixin to CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError. |
| azure/cosmos/_availability_strategy_handler.py | Records INITIAL/HEDGING dispatch + responding region in sync handler. |
| azure/cosmos/aio/_asynchronous_availability_strategy_handler.py | Same for async handler. |
| azure/cosmos/_synchronized_request.py | Creates per-op state, threads it through retry, attaches to response headers; adds _resolve_region_name. |
| azure/cosmos/aio/_asynchronous_request.py | Async parity for the above. |
| azure/cosmos/_retry_utility.py | New _record_retry_diagnostics_and_attach helper to record REGION_FAILOVER/OPERATION_RETRY and attach state to raised errors. |
| azure/cosmos/aio/_retry_utility_async.py | Async parity for retry recording. |
| CHANGELOG.md | Feature entry under 4.16.0b3 (Unreleased). |
| tests/test_diagnostics_types.py | Enum + frozen-dataclass tests. |
| tests/test_hedging_detection.py | Sync state + wrapper + handler tests, repo-wide grep for reserved reasons. |
| tests/test_hedging_detection_async.py | Async twin of the above plus sync↔async parity checker. |
| tests/livecanary/* | Opt-in live multi-region smoke tests (sync + async). |
| samples/hedging_detection*.py | New customer-facing samples. |
Comments suppressed due to low confidence (2)
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_responses.py:85
HEDGING_STATE_HEADER_KEY = "__hedging_state__"is set as a normal string key on the headers dict._response_headers_listfor paged operations holds the actualCaseInsensitiveDictinstances mutated by the orchestrator; once_pop_state_from_headersruns on a wrapper-constructed page (e.g., for queries that produceCosmosDictitems in addition to the page-level headers), the sentinel disappears from the sameCaseInsensitiveDictreference. After that,_refresh_hedging_state_from_pages(which scans_response_headers_listlooking for the sentinel) will miss the state on subsequent calls. Verify whether the paged-item construction path can ever pop the sentinel beforeCosmosItemPageditself caches it, otherwise the accessors may regress to defaults after the first call.
def _refresh_hedging_state_from_pages(self) -> None:
"""Internal: scan pages from newest to oldest and update
``self._hedging_state`` with the most recent attached state, if any.
Does not mutate the underlying headers dicts."""
if not self._response_headers_list:
return
for h in reversed(self._response_headers_list):
from ._diagnostics import HEDGING_STATE_HEADER_KEY
state = h.get(HEDGING_STATE_HEADER_KEY) if h is not None else None
if state is not None:
self._hedging_state = state
return
sdk/cosmos/azure-cosmos/tests/test_hedging_detection_async.py:334
- The reserved-reasons grep test is duplicated verbatim between
test_hedging_detection.pyandtest_hedging_detection_async.py. Both walk every.pyunderazure/cosmos/on every test run, and the comment-stripping regex only strips trailing#…plus docstring boundary lines, missing the case where the reserved names appear inside multi-line strings or pylint-disable comments. Consolidate into a single helper to avoid maintenance drift and reduce duplicate I/O in CI.
def test_reserved_reasons_absent_from_production_code_async(self):
"""Same repo-wide guarantee as the sync test, executed via the async
suite so it runs in CI even when the sync test is skipped."""
import os
import re
here = os.path.dirname(os.path.abspath(__file__))
prod_root = os.path.abspath(os.path.join(here, "..", "azure", "cosmos"))
assert os.path.isdir(prod_root)
offenders = []
for dirpath, _dirs, files in os.walk(prod_root):
for fn in files:
if not fn.endswith(".py") or fn == "_diagnostics_types.py":
continue
full = os.path.join(dirpath, fn)
with open(full, "r", encoding="utf-8") as fh:
for ln, line in enumerate(fh, start=1):
stripped = line.split("#", 1)[0]
if re.match(r"\s*('''|\"\"\")", stripped):
continue
for offender in (
"RequestedRegionReason.TRANSPORT_RETRY",
"RequestedRegionReason.CIRCUIT_BREAKER_PROBE",
):
if offender in stripped:
offenders.append((full, ln, offender))
assert offenders == [], repr(offenders)
Addresses 10 review comments on the hedging-detection API PR: - C1: drop `slots=True` from RequestedRegion dataclass — the `slots=` argument was added in Python 3.10 but the package's `python_requires` is `>=3.9`. Frozen still guarantees the immutability contract; pickle round-trip preserved (verified via `test_pickleable`). - C2: narrow `pytest.raises((asyncio.CancelledError, BaseException))` to `pytest.raises(asyncio.CancelledError)`. The original tuple collapsed to `BaseException` and would have masked any other raise. - C3: replace `pytest.raises((AttributeError, Exception))` with `pytest.raises(dataclasses.FrozenInstanceError)` so the test pins the exact exception type the dataclass machinery raises. - C4: introduce `_import_sibling_test_module()` helper in the parity test that wraps `importlib.import_module` with a clear `AssertionError` if a sibling module cannot be located, instead of letting a bare `ModuleNotFoundError` look like a parity failure. - C5: wrap `CosmosClient(...)` in the sync livecanary `client` fixture with a context manager so connection-pool / transport resources are released even on test failure. - C6: flatten `_resolve_region_name` in `_synchronized_request.py` — single `try` with explicit returns; the trailing `return ''` outside the `try` and the nested `try/except` were redundant. - C7: in `_retry_utility._record_retry_request` swap the `type(retry_policy).__name__` string match for an `isinstance` check against the two already-imported policy modules (`_timeout_failover_retry_policy` / `_endpoint_discovery_retry_policy`). Refactor-safe and type-checker friendly. - C8/C10: hoist `HEDGING_STATE_HEADER_KEY` to the top-level import in `_cosmos_responses.py`; removes four inline imports inside `_refresh_hedging_state_from_pages` and `_copy_headers_stripped` (both sync and async). - C11: explicit `.. note::` block on `_pop_state_from_headers` documenting that it mutates the input headers dict in place (which is what keeps the private sentinel from leaking via `CosmosDict.get_response_headers()`). All 53 unit tests in the new files pass on Python 3.11 locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Python 3.12 Standard emulator job hit an unrelated InternalServerError from the Cosmos emulator on test_database_replace_throughput (a pre-existing test that does not touch the hedging-detection code paths). Same test passed on 3.10/3.11/3.13/3.14 Standard and on the 3.12 sdist run minutes earlier in the same job. Pushing an empty commit to rerun. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Python 3.13 DependencyChecks emulator job hit AzDO's 60-minute job timeout during azure-mgmt-cosmosdb dependency install (before pytest even started). All 5 Standard emulator jobs (3.10/3.11/3.12/3.13/3.14) and 4 DependencyChecks jobs (3.10/3.11/3.12) passed. Re-running. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosmos: Hedging Detection API (Python, Option B)
Closes #46899
Summary
Adds the cross-SDK Hedging Detection API to
azure-cosmos(Python),shipping Option B from the public spec: three accessor methods —
is_hedging_started(),get_requested_regions(),get_responded_regions()— added directly to the four wrapper types and the three exception types
that customers receive from data-plane operations, backed by a shared
private
_HedgingDetectionState.Public surface (new)
RequestedRegion(frozen dataclass:region_name,reason)azure.cosmosRequestedRegionReason(enum, with_missing_ → UNKNOWN)azure.cosmosis_hedging_started() -> boolget_requested_regions() -> tuple[RequestedRegion, ...]get_responded_regions() -> tuple[str, ...]7 carrier types:
CosmosDict,CosmosList,CosmosItemPaged,CosmosAsyncItemPaged,CosmosHttpResponseError,CosmosBatchOperationError,CosmosClientTimeoutError.Design highlights
deep-copies
request_paramsat the worker boundary (line 96 of_availability_strategy_handler.py). Storing state onrequest_paramswould silently swallow child appends. State is therefore threaded through
the handler as an explicit
hedging_state=closure argument, into theretry utility via
kwargs["_hedging_state"], and back to the wrapperconstruction site by stashing on the response-headers dict under a
private sentinel key that each wrapper pops in its
__init__. Internalflag is also popped in
_Requestbefore reaching the HTTP pipeline.hedging-handler append happens AFTER the threshold sleep AND AFTER the
cancellation check, so a hedge arm that loses the race before its delay
elapses produces NO phantom entry.
RequestedRegionReason._missing_(cls, raw)returnscls.UNKNOWNfor anyunrecognized payload — calling
RequestedRegionReason("future_v42")never raises.
_HedgingDetectionStateuses athreading.Lockto guard the compound (append + flag set) update;snapshots are returned as immutable tuples.
Enum values are exported (forward compat) but a repo-wide grep test
asserts they are never emitted by production code.
Acceptance-criteria matrix
is_hedging_started()==Falsetest_record_initial_does_not_set_hedging_flagtest_cancelled_before_delay_records_nothing(sync + async)is_hedging_started()==True, HEDGING entry presenttest_record_hedging_sets_flag(sync + async)test_record_operation_retry_does_not_set_flagtest_cosmos_http_response_error_forwardsRequestedRegionfrozen / hashable / picklabletest_diagnostics_types.py(9 tests)copy.deepcopy(request_params)test_handler_dispatches_state_through_closure_not_params(sync + async)TestSyncAsyncParity.{test_parity_class_coverage,test_parity_method_coverage}test_cancelled_before_delay_records_nothing(sync + async)test_responded_regions_preserves_duplicatesTRANSPORT_RETRY/CIRCUIT_BREAKER_PROBEnever emitted in v1test_reserved_reasons_absent_from_production_code(repo-wide grep)tests/livecanary/test_hedging_detection_live[_async].py(opt-in via env vars)Test execution
Live canary suite is opt-in (skipped unless
COSMOS_MULTI_REGION_ENDPOINTand
COSMOS_MULTI_REGION_KEYenv vars are set).Files changed
Production (12)
azure/cosmos/_diagnostics_types.py(NEW)azure/cosmos/_diagnostics.py(NEW)azure/cosmos/__init__.pyazure/cosmos/_cosmos_responses.pyazure/cosmos/exceptions.pyazure/cosmos/_availability_strategy_handler.pyazure/cosmos/aio/_asynchronous_availability_strategy_handler.pyazure/cosmos/_synchronized_request.pyazure/cosmos/aio/_asynchronous_request.pyazure/cosmos/_retry_utility.pyazure/cosmos/aio/_retry_utility_async.pyCHANGELOG.mdTests / samples (7)
tests/test_diagnostics_types.py(NEW)tests/test_hedging_detection.py(NEW)tests/test_hedging_detection_async.py(NEW)tests/livecanary/__init__.py(NEW)tests/livecanary/test_hedging_detection_live.py(NEW)tests/livecanary/test_hedging_detection_live_async.py(NEW)samples/hedging_detection.py+samples/hedging_detection_async.py(NEW)Why no
.pyi?azure-cosmosdoesn't ship a__init__.pyi; onlypy.typedis present.Inline annotations on the new public types provide equivalent type-checker
coverage. Spec S2 was a SHOULD, not a MUST.
Notes for reviewers
PRs).
azure/cosmos/__init__.py.spec §7 forbids log-string scraping as a customer contract.