feat(data-pipeline)!: CSS Trace Filters#1985
Conversation
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1985 +/- ##
==========================================
+ Coverage 72.75% 72.94% +0.18%
==========================================
Files 458 460 +2
Lines 75827 76737 +910
==========================================
+ Hits 55169 55974 +805
- Misses 20658 20763 +105
🚀 New features to boost your workflow:
|
|
✨ Fix all issues with BitsAI or with Cursor
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
test: sort stats by resource for deterministic snapshot
…s-trace-filtering
…s-trace-filtering
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e0ff40bf4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // FIXME: when client_computed_top_level is true, looking twice for the root span here is | ||
| // inefficient and just below in process_traces_for_stats. | ||
| // Also, only do it when css is on | ||
| self.trace_filterer.filter_traces(&mut traces); |
There was a problem hiding this comment.
Preserve filtered traces for agent export
When the agent advertises filter_tags or ignore_resources, this mutates the only traces vector before it is passed to stats processing and then serialized/sent to the agent below. These filters are described in this change as client-side-stats filters, so a trace that should be excluded only from stats (for example matching ignore_resources) is now removed from the payload entirely and never reaches the agent/OTLP endpoint, causing trace data loss whenever CSS filters are configured. Apply the filter only to a stats-only copy/path, not the export path.
Useful? React with 👍 / 👎.
| fn check_agent_info(&self) { | ||
| if let Some(agent_info) = agent_info::get_agent_info() { | ||
| if self.has_agent_info_state_changed(&agent_info) { | ||
| // FIXME: trace_filterer should only be enabled when CSS is on. (why ?) |
There was a problem hiding this comment.
To my understanding this feature is a pre-requisite for filtering of stats to work, but it's a standalone feature, per se?
There was a problem hiding this comment.
I agree but it increases the risk in case the implementation is wrong: if we drop more trace than what the agent would have dropped it would affect everyone but if we apply filters only with CSS this only affects a small subset of users.
There was a problem hiding this comment.
I agree with @bwoebi.
Complex features (like the trace exporter, V1, CSS, etc) should be gradually rolled out. Does this rise to the level of complexity where we need a gradual rollout (I'm not confident it does)? If so, the correct way to do that is let the SDKs handle it. We should provide an option in the builder so that SDKs can opt-in, and coordinate with LP to figure out how they want to approach the rollout.
risk in case the implementation is wrong
We should be confident in our implementations. Doesn't this functionality already exist in several tracers? Are there existing system tests? Do we have sufficient unit and integration tests in libdatadog? The point of gradual rollouts is to limit the blast radius of unknown unknowns when running on customer environments. This PR doesn't really introduce any new and complicated paradigms. It's behavior should be deterministic and reproducible.
There was a problem hiding this comment.
I've already written a PR for adding system-test for this feature: DataDog/system-tests#6952. It passes locally with dd-trace-py using this branch.
I will change it to only apply trace filters when CSS is enabled to reduce the impact.
There was a problem hiding this comment.
any particular reason why the filename is using a combo of single and double underscores?
There was a problem hiding this comment.
This is generated by insta snapshot tests. The file name corresponds to the absolute path to the test function with :: replaced with double underscores. It corresponds to this test: https://github.com/DataDog/libdatadog/pull/1985/changes#diff-4495bd699a7cc3c377b8a943aae2c2ab9e48fa249f1e8016c5f4d45886982604R2178
I added this kind of test for end-to-end testing of the TraceExporter with trace filters without using the test-agent so we can have more control for the response we're getting in /info. It might be overkill for this simple feature but I think it could come in handy for these kind of end-to-end tests.
|
It looks like this PR is still in progress and should probably be moved to draft. |
| if let Some(agent_info) = agent_info::get_agent_info() { | ||
| if self.has_agent_info_state_changed(&agent_info) { | ||
| // FIXME: trace_filterer should only be enabled when CSS is on. (why ?) | ||
| self.trace_filterer.update_conf( |
| // TODO: doc | ||
| pub fn set_filter_tags(&mut self, filter_tags: FilterTagsConfig) -> &mut Self { | ||
| self.filter_tags = filter_tags; | ||
| self | ||
| } | ||
|
|
||
| // TODO: doc | ||
| pub fn set_filter_tags_regex(&mut self, filter_tags_regex: FilterTagsConfig) -> &mut Self { | ||
| self.filter_tags_regex = filter_tags_regex; | ||
| self | ||
| } | ||
|
|
||
| // TODO: doc | ||
| pub fn set_ignore_resources(&mut self, ignore_resources: Vec<String>) -> &mut Self { | ||
| self.ignore_resources = ignore_resources; | ||
| self | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't think we want to configure trace filters programmatically. They're only passed by the agent /info.
There was a problem hiding this comment.
I left a related comment elsewhere.
I agree, we should probably only rely on /info.
| } | ||
|
|
||
| // FIXME: duplicated with super::get_root_span_index | ||
| pub fn get_root_span_index_v4<T>(trace: &[Span<T>]) -> anyhow::Result<usize> |
There was a problem hiding this comment.
I don't think this is specific to v4
| } | ||
| } | ||
|
|
||
| // FIXME: duplicated with super::get_root_span_index |
There was a problem hiding this comment.
Yes this not an ideal solution at the time we we're hoping to be able to migrate serverless from pb::Span to Span but protobuf serialization support has been stopping us from doing so. We can improve it in the future but it's fine to duplicate it for now.
| }; | ||
|
|
||
| let value = match value { | ||
| Some(value) => match compile_anchored(value) { |
There was a problem hiding this comment.
Why do we have to force it to be anchored ?
There was a problem hiding this comment.
This is indeed not specified anywhere from what I've seen.
This behavior is different between the two implementation I've been using as reference:
dd-trace-dotnetand the agent do not anchordd-trace-phpdoes anchor
If I had to choose I would say that it's more intuitive for the user for the patterns to automatically be anchored but in any-case this needs to be added to the spec and in the public documentation.
There was a problem hiding this comment.
The trace filters is an agent feature that we are porting to the tracer to make CSS compatible with this feature. Therefore the only implementation reference is the agent and our implementation should be bug for bug compatible with the agent.
There was a problem hiding this comment.
This is just the impression I got from the example in the spec:
"filter_tags_regex": {
"reject": ["version:.*-beta", "experimental_.*"]
},
If we were not to anchor, the leading or trailing .* would be irrelevant.
There was a problem hiding this comment.
Also, it would make it surprising that suffixes are not filtered. e.g. db[0-9]+ would match db1, but also db1-staging
| /// O(1) lookup. `.` is intentionally treated as a literal (not a wildcard) in key patterns. | ||
| fn is_literal_key(key: &str) -> bool { | ||
| !key.contains([ | ||
| '*', '+', '?', '[', ']', '(', ')', '{', '}', '^', '$', ',', '\\', |
There was a problem hiding this comment.
Should this include "|" ?
There was a problem hiding this comment.
Yes. Needs to be fixed in dd-trace-php too
| } | ||
|
|
||
| /// Returns `true` when `key` contains no regex metacharacters and can be used for a direct | ||
| /// O(1) lookup. `.` is intentionally treated as a literal (not a wildcard) in key patterns. |
There was a problem hiding this comment.
Regarding "." I don't see it in the spec, is it from the agent ? Also how is this enforced when matching an actual regex in the key
There was a problem hiding this comment.
Also not specified and not enforced when matching either. This is copied from dd-trace-php implementation but:
dd-trace-dotnetdoesn't do this separation, it always treats regex filters as key regex and value regex basically (which is less efficient)- the agent doesn't support regex key
I think treating "." as a literal does make sense given how common it is in tag keys but it might make it less intuitive because right now it's a literal only if there is no other regex metacharacters in which case it's actually a wildcard.
It needs to be added to the spec anyway but I see 2 options here:
- Never treat it as a literal
- Enforce it being a literal by escaping it in the final pattern if there is other metacharacters
There was a problem hiding this comment.
the agent doesn't support regex key
Then we shouldn't support regex key and update the CSS spec to be aligned with the agent
There was a problem hiding this comment.
Yes, that's an optimization, you basically always use .* or something similar.
Otherwise any key which includes a dot will need to do a full-scan match against every single tag.
I found that for any practical purpose, this is a fine optimization to do.
Not doing this optimization clearly shows negatively up in performance terms for trivial value-only regexes otherwise.
But indeed, not supporting it at all is probably better :-D
| let conf = self.conf.load(); | ||
| traces.retain(|trace| { | ||
| let Ok(root_span_index) = get_root_span_index_v4(trace) else { | ||
| // FIXME: in this case it's a distributed trace ? Maybe we should remove the debug |
There was a problem hiding this comment.
In the agent is this applied to based on root span only, or on local root span ?
There was a problem hiding this comment.
Honestly not sure. Seems like traces are filtered chunk by chunk in the agent, it doesn't seem to wait for the whole distributed trace to arrive.
Needs to be double checked.
| return true; | ||
| } | ||
|
|
||
| if !conf.ignore_resources.is_empty() { |
|
Starting to think another RFC is needed for the CSS spec because implementation differs quite a bit between existing implementations already: dd-trace-php, dd-trace-dotnet and the agent But trace filter is not really linked to CSS: it's a requirement but it can be enabled separately so it could maybe live in another spec ? |
What does this PR do?
Implements the CSS trace-level filtering mechanism, that is applied before stats computation.
Motivation
It's un unmet requirement from the spec.
Additional Notes
Please Ctrl+F
FIXMEwhen reviewing, I let some questions in the code that I'd like answers for before mergingTODO:
Wont do
advanced resource normalization (sql, redis etc...).No other implementation does that100% normalization edge cases (empty service field + non-normalized name + filter depending on that)Too specific edge case. Ignoring it for nowAPMSP-2763