From 8e1f2716e10eab4f3c2dec08ab80832d8b891816 Mon Sep 17 00:00:00 2001 From: Jarek Potiuk Date: Tue, 2 Jun 2026 20:17:21 +0200 Subject: [PATCH 1/6] Add draft threat model + SECURITY.md/AGENTS.md discoverability Generated-by: Claude Code --- AGENTS.md | 27 +++++++ SECURITY.md | 29 +++++++ THREAT_MODEL.md | 205 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 261 insertions(+) create mode 100644 AGENTS.md create mode 100644 SECURITY.md create mode 100644 THREAT_MODEL.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000000..1a28bf1d4e2 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,27 @@ + + +# Agent Guide for jena + +This file is read by automated agents (security scanners, code +analyzers, AI assistants) operating on this repository. + +## Security + +Security model: [SECURITY.md](./SECURITY.md) + +Agents that scan this repository should consult `SECURITY.md` and the +threat model it links before reporting issues. diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000000..c8c12d6ffeb --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,29 @@ + + +# Security Policy + +## Reporting a Vulnerability + +`apache/jena` follows the [Apache Software Foundation security process](https://www.apache.org/security/). Please report suspected +vulnerabilities privately to `security@apache.org`; do not open public +GitHub issues or pull requests for security reports. + +## Threat Model + +What the project treats as in scope and out of scope, the security +properties it provides and disclaims, the adversary model, and how +findings are triaged are documented in [THREAT_MODEL.md](./THREAT_MODEL.md). diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md new file mode 100644 index 00000000000..40ec5d9200d --- /dev/null +++ b/THREAT_MODEL.md @@ -0,0 +1,205 @@ + + +# Apache Jena — Threat Model (v0 draft) + +## §1 Header + +- **Project:** Apache Jena (`apache/jena`), `main`, against which this draft was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL/ShEx, GeoSPARQL, text index) **and** the Fuseki HTTP server (`jena-fuseki2`). +- **Date:** 2026-06-02. **Status:** draft — for Apache Jena PMC review. **Author:** ASF Security team (drafted via the Scovetta threat-model rubric), for PMC ratification. +- **Version binding:** versioned with the project; a report against version *N* is triaged against the model as it stood at *N*. +- **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. +- **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question. +- **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred. +- **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* + +## §2 Scope and intended use + +- **Two deployment shapes** *(maintainer — andy@)*: + - **Fuseki** — a long-running **HTTP server** that answers SPARQL over the network. The primary network trust surface. + - **The Jena Java API** — `jena-core`/`jena-arq`/TDB embedded **in-process** in another application. Trusted caller; the bytes/queries it feeds Jena are that application's responsibility. +- **Caller roles** (Fuseki is a network service — the role splits): + - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. + - **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*. + - **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)* + - **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)* + +**Component-family table** *(monorepo; in/out of model):* + +| Family | Entry point | Touches OS/network | In model? | +| --- | --- | --- | --- | +| Fuseki HTTP server | `jena-fuseki2` — SPARQL query / Update / Graph Store Protocol, admin `/$/*` | network (listens) | **In — primary boundary** *(documented)* | +| SPARQL engine (ARQ) | `jena-arq` — query/update eval, `SERVICE` federation, custom functions | network out (SERVICE), file (file: URLs) | **In — high value** *(inferred)* | +| RDF I/O (RIOT) | `jena-arq`/`jena-core` parsers (RDF/XML, Turtle, JSON-LD, …) | parses untrusted RDF | **In — XXE / parser-DoS surface** *(inferred)* | +| Stores | `jena-tdb1`, `jena-tdb2`, `jena-db` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | +| IRI / langtag | `jena-iri3986`, `jena-langtag`, `jena-base` | none | **In (input parsing)** *(inferred)* | +| Validation / extensions | `jena-shacl`, `jena-shex`, `jena-geosparql`, `jena-text`, `jena-serviceenhancer` | text index; SERVICE | **In (reachable from queries)** *(inferred)* | +| Client/API helpers | `jena-rdfconnection`, `jena-querybuilder`, `jena-rdfpatch`, `jena-commonsrdf`, `jena-ontapi` | none | **In as libraries (memory/correctness)** *(inferred)* | +| CLI tools | `jena-cmds` | filesystem | **In iff fed untrusted input; usually operator-run** *(inferred)* | +| Examples / tests / benchmarks | `jena-examples`, `jena-integration-tests`, `jena-benchmarks` | n/a | **Out** *(see §3)* | + +## §3 Out of scope (explicit non-goals) + +- **`jena-examples`, `jena-integration-tests`, `jena-benchmarks`** — illustrative/test, not production. *(inferred)* +- **Attackers who control the host, the Fuseki config (`shiro.ini`, dataset config), the TDB data directory, or the embedding Java application.** Operator-trusted. *(inferred)* +- **The embedding application's own use of the Java API** — if an app feeds attacker-controlled SPARQL it built by string-concatenation to ARQ, that injection is the app's bug, not Jena's (analogous to SQL injection in a JDBC caller). *(inferred)* +- **Generic DoS / query-complexity exhaustion** beyond a to-be-confirmed line — Andy raised resource-volume as a concern; the §8 resource line + §14 frame it. *(inferred)* +- **Confidentiality of RDF data at rest / TLS on the wire** — operator deployment (reverse proxy for TLS; filesystem perms for TDB). *(inferred)* + +## §4 Trust boundaries and data flow + +- **Primary boundary: the Fuseki SPARQL endpoint.** Queries arrive over HTTP from (by default) **anonymous** clients. The boundary question is what an anonymous/low-privilege SPARQL query can reach: read data it shouldn't, **write** (SPARQL Update / GSP) without authorisation, make Fuseki issue outbound requests (`SERVICE` → SSRF), read local files (`file:` URLs / FROM), execute code (ARQ custom/JavaScript functions if enabled), or exhaust resources. *(inferred; public-query default documented)* +- **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network is an operator misconfiguration. +- **RDF-parse boundary:** any endpoint that **parses** caller-supplied RDF (Update bodies, GSP PUT/POST, content negotiation) runs RIOT on untrusted bytes — the XXE (RDF/XML) and parser-DoS surface. *(inferred)* +- **Reachability preconditions:** + - A finding in ARQ/RIOT/stores is **in-model** iff reachable from a Fuseki request at the relevant role (default: anonymous query; authenticated for Update). *(inferred)* + - A finding reachable only through the **in-process Java API** with caller-supplied trusted input is `OUT-OF-MODEL: trusted-input` (the embedding app owns it). *(inferred)* + - A finding requiring operator config (`shiro.ini`, exposing admin, enabling JS functions) is `OUT-OF-MODEL: trusted-input` / `non-default-build`. *(inferred)* + +## §5 Assumptions about the environment + +- **Runtime:** JVM (Java; "old in places" per andy@). *(maintainer)* +- **Fuseki auth:** Apache Shiro via `$FUSEKI_BASE/shiro.ini`; changing it needs a restart *(documented — Fuseki security docs)*. +- **Store:** TDB1/TDB2 on the local filesystem, assumed private to the Fuseki/JVM process. *(inferred)* +- **Network:** TLS is the deployer's (reverse proxy); Fuseki's bundled example setup is plaintext *(documented — "no TLS, passwords in plain text")*. +- **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation) and can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* + +## §5a Build-time and configuration variants + +Security-relevant configuration *(Fuseki auth documented; the rest inferred — confirm defaults):* + +| Knob | Default | Effect / stance | +| --- | --- | --- | +| Fuseki Shiro auth (`shiro.ini`) | SPARQL **query** public; admin `/$/*` **localhost-only** | *(documented)* Restricting query access requires Shiro `[urls]` ACLs. | +| Fuseki example user setup | `admin`/`pw`, plaintext, no TLS | *(documented)* explicitly "not recommended for production". Any "default admin/pw in prod" report → `OUT-OF-MODEL: non-default-build`. | +| SPARQL **Update** / Graph Store write | per-dataset (read-only vs read-write service) — **default to confirm** | *(inferred)* If a dataset ships update-enabled + unauthenticated, anonymous write is in-model; if read-only by default, anonymous write is not reachable. **Wave-1 question.** | +| `SERVICE` (federated query) | **to confirm** (enabled? restrictable allow-list?) | *(inferred)* SSRF surface; whether it can be disabled / allow-listed is the key §10 lever. | +| ARQ **JavaScript / custom functions** | **to confirm** (opt-in?) | *(inferred)* If enabled, SPARQL can execute code → by-design-if-operator-enabled, like a trusted extension. | +| RDF/XML & external-entity handling in RIOT | **to confirm** (XXE off by default?) | *(inferred)* Whether external entities / `file:` access are disabled by default in the parsers. | +| Query timeout / result limits | **to confirm** | *(inferred)* the resource/DoS lever (Andy's concern). | + +## §6 Assumptions about inputs + +Per-surface trust table *(Fuseki defaults documented; the rest inferred):* + +| Surface | Input | Attacker-controllable? | Caller/operator must enforce | +| --- | --- | --- | --- | +| Fuseki SPARQL query endpoint | SPARQL query text | **yes (anonymous by default)** | Shiro ACLs if data is sensitive; SERVICE/file/JS-function restrictions; query timeout | +| Fuseki SPARQL Update / GSP | update text / RDF body | **yes — must be authorised** | read-only-by-default or Shiro-gated write; RDF parse hardening | +| RDF parse (RIOT) anywhere | RDF/XML, Turtle, JSON-LD, … | **yes** | external-entity (XXE) off; bounded nesting/size | +| `SERVICE ` in a query | target URL | **yes** | SSRF egress controls / allow-list | +| `FROM` / `FROM NAMED` / `file:` URI | dataset URI | **yes** | block `file:` and arbitrary fetch from untrusted queries | +| Fuseki admin `/$/*` | dataset mgmt, backups | **must not be on the public net** | localhost-only (default) / operator network | +| Java API (`QueryExecution`, `Model.read`) | query / RDF from the app | no — the embedding app's trust | app validates its own untrusted inputs | + +- **Size/shape/rate:** query-cost / result-size / parser-nesting bounds — to confirm (Andy's volume concern); §8 resource line. *(inferred)* + +## §7 Adversary model + +- **Anonymous SPARQL client (primary)** — can reach Fuseki's public query endpoint; goals: read non-public graphs, write via an exposed Update endpoint, SSRF via `SERVICE`, local-file read via `file:`/FROM, code execution via JS functions (if enabled), resource exhaustion via expensive queries. *(inferred; public-query default documented)* +- **Authenticated low-privilege user** — bounded by Shiro/dataset ACLs; goal: exceed them. *(inferred)* +- **Crafted-RDF attacker** — supplies malicious RDF (RDF/XML XXE, deeply-nested/oversized documents) to any parse path. *(inferred)* +- **Out of scope:** operator/host control; the embedding app supplying its own trusted input; anyone who can edit `shiro.ini` or enable JS functions. *(inferred)* + +## §8 Security properties the project provides + +*(All inferred pending PMC confirmation except where Fuseki defaults are documented.)* + +- **Admin surface is localhost-bound by default.** Fuseki's `/$/*` admin functions are not reachable from the network unless the operator exposes them. *Violation symptom:* an admin function reachable anonymously over the network in the default config. *Severity:* CVE-class. *(documented — Fuseki security docs)* +- **Shiro access control is enforced when configured.** A Shiro `[urls]` ACL restricting an endpoint cannot be bypassed by request manipulation. *Violation symptom:* a restricted endpoint reached without satisfying its Shiro rule. *Severity:* CVE-class. *(inferred)* +- **SPARQL queries cannot escape the dataset's authorised scope.** An anonymous/low-priv query cannot read graphs, write data, reach the filesystem, or make Fuseki act as an SSRF proxy beyond what the dataset config permits. *Violation symptom:* SERVICE-SSRF, `file:` read, cross-graph read, or unauthorised write from an in-scope query. *Severity:* CVE-class. *(inferred — the core boundary to ratify; these are the classic Jena CVE classes)* +- **RDF parsing is safe against untrusted documents.** RIOT parsing of attacker RDF does not resolve external entities (XXE), execute code, or recurse/allocate unboundedly. *Violation symptom:* XXE, SSRF, or DoS from a parsed RDF document. *Severity:* CVE-class. *(inferred)* +- **Resource bounds — UNSPECIFIED.** Whether an expensive SPARQL query (Andy's volume concern) or a large RDF body is a bug or an operator-tuned limit (query timeout) is open. *(inferred)* + +## §9 Security properties the project does *not* provide + +- **No protection if the operator exposes the admin surface, ships the example `admin`/`pw` setup, or runs without TLS** — deployment hardening (pending §5a rulings). *(documented that the example setup is not for production)* +- **No defense when ARQ JavaScript/custom functions are enabled on an untrusted endpoint** — enabling code-executing functions and exposing them to anonymous queries is operator-chosen code execution (by-design, like a trusted extension), pending confirmation. *(inferred)* **False friend:** a SPARQL endpoint being "read-only" does not by itself prevent SSRF (`SERVICE`) or local-file read (`file:`) unless those are separately restricted. +- **No SPARQL-injection defense for the embedding application** — an app that concatenates untrusted input into a query string owns that bug (use parameterised queries / `QueryBuilder`). *(inferred)* +- **No transport security / authentication unless the operator configures Shiro + TLS.** *(documented/inferred)* +- **No generic-DoS / query-complexity guarantee** beyond a to-be-stated line. *(inferred)* +- **Well-known classes left to the caller/operator:** SSRF via `SERVICE`, local-file disclosure via `file:`/FROM, XXE in RDF/XML, SPARQL injection (embedding app), and algorithmic-complexity DoS via crafted queries. *(inferred — Jena's published CVE history clusters here; confirm in §14)* + +## §10 Downstream responsibilities (operator/deployer) + +- **Put Fuseki behind auth (Shiro) + TLS** before exposing sensitive data; never ship the example `admin`/`pw` setup to production. *(documented)* +- **Keep the admin `/$/*` surface localhost-only / operator-network.** *(documented)* +- **Make datasets read-only unless write is intended**, and gate SPARQL Update / GSP behind Shiro. *(inferred)* +- **Restrict or disable `SERVICE` federation and `file:` access** on endpoints reachable by untrusted clients (SSRF / local-file). *(inferred)* +- **Do not enable ARQ JavaScript/custom functions on untrusted endpoints.** *(inferred)* +- **Set query timeouts / result-size limits** appropriate to capacity (the volume lever). *(inferred)* +- **Use parameterised queries** (`QueryBuilder`/parameterised `QueryExecution`) in embedding apps; never string-concatenate untrusted input into SPARQL. *(inferred)* + +## §11 Known misuse patterns + +*(Draft one-liners — expand before publishing.)* + +- Exposing a public, update-enabled SPARQL endpoint with no auth. *(inferred)* +- Leaving `SERVICE`/`file:` reachable from anonymous queries (SSRF / file read). *(inferred)* +- Enabling ARQ JS functions on a public endpoint. *(inferred)* +- Shipping the example `admin`/`pw` / no-TLS Fuseki setup to production. *(documented as not-for-prod)* +- Building SPARQL by concatenating untrusted strings in an embedding app. *(inferred)* +- Parsing untrusted RDF/XML without external-entity protections. *(inferred)* + +## §11a Known non-findings (recurring false positives) + +*(Seed list — confirmations are the highest-leverage scan-suppression input.)* + +- "Fuseki SPARQL endpoint is open without auth" — public **query** is the documented default; restricting it is the operator's Shiro config. A report is `VALID` only if a *configured* restriction is bypassed or an *update/admin* surface is anonymously reachable. *(documented default)* +- "Default `admin`/`pw`, no TLS" — the example setup, documented as not-for-production → `OUT-OF-MODEL: non-default-build`. *(documented)* +- "SPARQL query consumes lots of CPU/memory" — pending the §8 resource line; likely operator-tuned (query timeout) unless super-linear on a small query. *(inferred)* +- "ARQ can call JavaScript / custom functions" — only if the operator enabled them; on a trusted/admin endpoint that's by-design. `OUT-OF-MODEL: trusted-input` / `non-default-build` unless reachable anonymously. *(inferred — confirm the default)* +- "Embedding app built an injectable SPARQL string" — the app's bug, not Jena's. `OUT-OF-MODEL: trusted-input`. *(inferred)* + +## §12 Conditions that would change this model + +- A change to Fuseki's default auth posture (public-query / localhost-admin), the example-setup defaults, or the SPARQL-Update default. *(documented knobs)* +- A change to `SERVICE`/`file:`/JS-function defaults or their restrictability. *(inferred)* +- A new network surface or a new parser. *(inferred)* +- A report that cannot be routed to one §13 disposition → revise the model. + +## §13 Triage dispositions + +| Disposition | Meaning | Licensed by | +| --- | --- | --- | +| `VALID` | Violates a §8 property via an in-scope adversary/input (config-bypass, anonymous write/admin, SSRF/file-read/XXE/code-exec from an in-scope query under default config). | §8, §6, §7 | +| `VALID-HARDENING` | No §8 property broken, but a §11 misuse is easy enough to harden (safer defaults, SERVICE allow-list, parser limits). | §11 | +| `OUT-OF-MODEL: trusted-input` | Requires operator config (shiro.ini, enabling JS functions, exposing admin) or the embedding app's own untrusted input. | §6, §7 | +| `OUT-OF-MODEL: adversary-not-in-scope` | Requires host/JVM/config control. | §7 | +| `OUT-OF-MODEL: unsupported-component` | Lands in `jena-examples` / tests / benchmarks. | §3 | +| `OUT-OF-MODEL: non-default-build` | Only manifests under a discouraged/non-default §5a setting (example creds, JS functions on, admin exposed). | §5a | +| `BY-DESIGN: property-disclaimed` | Concerns a §9-disclaimed property (operator-enabled code exec, no-TLS-by-default, embedding-app SPARQL injection). | §9 | +| `KNOWN-NON-FINDING` | Matches a §11a entry. | §11a | +| `MODEL-GAP` | Cannot be routed — triggers §12. | §12 | + +## §14 Open questions for the maintainers + +**Wave 1 — scope & Fuseki defaults:** +1. Confirm scope is the `apache/jena` monorepo with **Fuseki + ARQ + RIOT + TDB** as the in-model core, and `jena-examples`/tests/benchmarks out. → §2/§3. +2. **SPARQL Update / Graph Store write default:** does a Fuseki dataset ship **read-only** by default, or can it be update-enabled-and-unauthenticated? (Decides whether anonymous write is in-model or a misconfig.) → §5a/§8. +3. Confirm the documented default — public query, localhost-only admin, example `admin`/`pw` is not-for-production (`non-default-build`). → §5a/§11a. + +**Wave 2 — the high-value query surfaces (the Jena CVE classes):** +4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. +5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. +6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. +7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. + +**Wave 3 — resources, API, meta:** +8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. +9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. +10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. +11. **Meta:** Jena has no in-repo `SECURITY.md`/`AGENTS.md`; this engagement adds them + `THREAT_MODEL.md`, wiring `AGENTS.md → SECURITY.md → THREAT_MODEL.md`. The Fuseki security docs live on the website. Confirm the in-repo model is canonical and references the website docs; confirm revision ownership. → §1. From 530406b6a22c30602dd84152253f207a7cf4fb2e Mon Sep 17 00:00:00 2001 From: Rob Vesse Date: Thu, 11 Jun 2026 09:34:26 +0100 Subject: [PATCH 2/6] Apply suggestions from code review Co-authored-by: Rob Vesse Co-authored-by: Andy Seaborne --- THREAT_MODEL.md | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index 40ec5d9200d..d6d23987d18 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -29,10 +29,10 @@ limitations under the License. ## §2 Scope and intended use - **Two deployment shapes** *(maintainer — andy@)*: - - **Fuseki** — a long-running **HTTP server** that answers SPARQL over the network. The primary network trust surface. + - **Fuseki** — a long-running **HTTP server** that answers SPARQL Query and SPARQL Update as well as the SPARQL Graph Store Protocol (read and read-write forms) over the network. The primary network trust surface. - **The Jena Java API** — `jena-core`/`jena-arq`/TDB embedded **in-process** in another application. Trusted caller; the bytes/queries it feeds Jena are that application's responsibility. - **Caller roles** (Fuseki is a network service — the role splits): - - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. + - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for SPARQL query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. - **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*. - **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)* - **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)* @@ -44,9 +44,9 @@ limitations under the License. | Fuseki HTTP server | `jena-fuseki2` — SPARQL query / Update / Graph Store Protocol, admin `/$/*` | network (listens) | **In — primary boundary** *(documented)* | | SPARQL engine (ARQ) | `jena-arq` — query/update eval, `SERVICE` federation, custom functions | network out (SERVICE), file (file: URLs) | **In — high value** *(inferred)* | | RDF I/O (RIOT) | `jena-arq`/`jena-core` parsers (RDF/XML, Turtle, JSON-LD, …) | parses untrusted RDF | **In — XXE / parser-DoS surface** *(inferred)* | -| Stores | `jena-tdb1`, `jena-tdb2`, `jena-db` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | +| Stores | `jena-tdb1`, `jena-tdb2`, `jena-text` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | | IRI / langtag | `jena-iri3986`, `jena-langtag`, `jena-base` | none | **In (input parsing)** *(inferred)* | -| Validation / extensions | `jena-shacl`, `jena-shex`, `jena-geosparql`, `jena-text`, `jena-serviceenhancer` | text index; SERVICE | **In (reachable from queries)** *(inferred)* | +| Validation / extensions | `jena-shacl`, `jena-shex`, `jena-geosparql`, `jena-serviceenhancer` | SERVICE | **In (reachable from queries)** *(inferred)* | | Client/API helpers | `jena-rdfconnection`, `jena-querybuilder`, `jena-rdfpatch`, `jena-commonsrdf`, `jena-ontapi` | none | **In as libraries (memory/correctness)** *(inferred)* | | CLI tools | `jena-cmds` | filesystem | **In iff fed untrusted input; usually operator-run** *(inferred)* | | Examples / tests / benchmarks | `jena-examples`, `jena-integration-tests`, `jena-benchmarks` | n/a | **Out** *(see §3)* | @@ -62,8 +62,8 @@ limitations under the License. ## §4 Trust boundaries and data flow - **Primary boundary: the Fuseki SPARQL endpoint.** Queries arrive over HTTP from (by default) **anonymous** clients. The boundary question is what an anonymous/low-privilege SPARQL query can reach: read data it shouldn't, **write** (SPARQL Update / GSP) without authorisation, make Fuseki issue outbound requests (`SERVICE` → SSRF), read local files (`file:` URLs / FROM), execute code (ARQ custom/JavaScript functions if enabled), or exhaust resources. *(inferred; public-query default documented)* -- **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network is an operator misconfiguration. -- **RDF-parse boundary:** any endpoint that **parses** caller-supplied RDF (Update bodies, GSP PUT/POST, content negotiation) runs RIOT on untrusted bytes — the XXE (RDF/XML) and parser-DoS surface. *(inferred)* +- **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network (without configuring authentication/authorisation) is an operator misconfiguration. +- **RDF-parse boundary:** any endpoint that **parses** caller-supplied RDF (Update bodies, GSP PUT/POST, content negotiation) runs RIOT on untrusted bytes — the XXE (RDF/XML), JSON-LD Context, and parser-DoS surface. *(inferred)* - **Reachability preconditions:** - A finding in ARQ/RIOT/stores is **in-model** iff reachable from a Fuseki request at the relevant role (default: anonymous query; authenticated for Update). *(inferred)* - A finding reachable only through the **in-process Java API** with caller-supplied trusted input is `OUT-OF-MODEL: trusted-input` (the embedding app owns it). *(inferred)* @@ -73,9 +73,9 @@ limitations under the License. - **Runtime:** JVM (Java; "old in places" per andy@). *(maintainer)* - **Fuseki auth:** Apache Shiro via `$FUSEKI_BASE/shiro.ini`; changing it needs a restart *(documented — Fuseki security docs)*. -- **Store:** TDB1/TDB2 on the local filesystem, assumed private to the Fuseki/JVM process. *(inferred)* +- **Store:** TDB1/TDB2 on the local filesystem, private to the owning Fuseki/JVM process, multiple processes accessing a single store location prevented by code. *(maintainer)* - **Network:** TLS is the deployer's (reverse proxy); Fuseki's bundled example setup is plaintext *(documented — "no TLS, passwords in plain text")*. -- **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation) and can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* +- **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation), `SERVICE` can be disabled by operator in configuration; ARQ can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* ## §5a Build-time and configuration variants @@ -86,10 +86,11 @@ Security-relevant configuration *(Fuseki auth documented; the rest inferred — | Fuseki Shiro auth (`shiro.ini`) | SPARQL **query** public; admin `/$/*` **localhost-only** | *(documented)* Restricting query access requires Shiro `[urls]` ACLs. | | Fuseki example user setup | `admin`/`pw`, plaintext, no TLS | *(documented)* explicitly "not recommended for production". Any "default admin/pw in prod" report → `OUT-OF-MODEL: non-default-build`. | | SPARQL **Update** / Graph Store write | per-dataset (read-only vs read-write service) — **default to confirm** | *(inferred)* If a dataset ships update-enabled + unauthenticated, anonymous write is in-model; if read-only by default, anonymous write is not reachable. **Wave-1 question.** | -| `SERVICE` (federated query) | **to confirm** (enabled? restrictable allow-list?) | *(inferred)* SSRF surface; whether it can be disabled / allow-listed is the key §10 lever. | -| ARQ **JavaScript / custom functions** | **to confirm** (opt-in?) | *(inferred)* If enabled, SPARQL can execute code → by-design-if-operator-enabled, like a trusted extension. | -| RDF/XML & external-entity handling in RIOT | **to confirm** (XXE off by default?) | *(inferred)* Whether external entities / `file:` access are disabled by default in the parsers. | -| Query timeout / result limits | **to confirm** | *(inferred)* the resource/DoS lever (Andy's concern). | +| `SERVICE` (federated query) | may be disabled by operator config **(documented)** | *(inferred)* SSRF surface +| ARQ **JavaScript / custom functions** | opt-in feature, requires explicit operator config of both Fuseki and JVM | *(inferred)* If enabled, SPARQL can execute code, executable JS functions controlled by explicit white list *(documented)*, some JS functions, e.g. `eval()`, are explicitly blacklisted regardless of whitelist → by-design-if-operator-enabled, like a trusted extension. Java custom functions require explicit operator configuration of class path, if added to class path operator responsibility to verify function code is safe | +| RDF/XML & external-entity handling in RIOT | XXE off | *(inferred)* Whether external entities / `file:` access are disabled by default in the parsers. | +| JSON_LD & external context handling in RIOT | On | Accessed by http/https or local file. | +| Query timeout / result limits | query timeout configurable at server or per-dataset level *(documented)* | *(inferred)* the resource/DoS lever (Andy's concern). | ## §6 Assumptions about inputs @@ -100,7 +101,7 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* | Fuseki SPARQL query endpoint | SPARQL query text | **yes (anonymous by default)** | Shiro ACLs if data is sensitive; SERVICE/file/JS-function restrictions; query timeout | | Fuseki SPARQL Update / GSP | update text / RDF body | **yes — must be authorised** | read-only-by-default or Shiro-gated write; RDF parse hardening | | RDF parse (RIOT) anywhere | RDF/XML, Turtle, JSON-LD, … | **yes** | external-entity (XXE) off; bounded nesting/size | -| `SERVICE ` in a query | target URL | **yes** | SSRF egress controls / allow-list | +| `SERVICE ` in a query | target URL | **yes** | Disable if not desired; SSRF egress controls / allow-list if enabled | | `FROM` / `FROM NAMED` / `file:` URI | dataset URI | **yes** | block `file:` and arbitrary fetch from untrusted queries | | Fuseki admin `/$/*` | dataset mgmt, backups | **must not be on the public net** | localhost-only (default) / operator network | | Java API (`QueryExecution`, `Model.read`) | query / RDF from the app | no — the embedding app's trust | app validates its own untrusted inputs | @@ -120,9 +121,9 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* - **Admin surface is localhost-bound by default.** Fuseki's `/$/*` admin functions are not reachable from the network unless the operator exposes them. *Violation symptom:* an admin function reachable anonymously over the network in the default config. *Severity:* CVE-class. *(documented — Fuseki security docs)* - **Shiro access control is enforced when configured.** A Shiro `[urls]` ACL restricting an endpoint cannot be bypassed by request manipulation. *Violation symptom:* a restricted endpoint reached without satisfying its Shiro rule. *Severity:* CVE-class. *(inferred)* -- **SPARQL queries cannot escape the dataset's authorised scope.** An anonymous/low-priv query cannot read graphs, write data, reach the filesystem, or make Fuseki act as an SSRF proxy beyond what the dataset config permits. *Violation symptom:* SERVICE-SSRF, `file:` read, cross-graph read, or unauthorised write from an in-scope query. *Severity:* CVE-class. *(inferred — the core boundary to ratify; these are the classic Jena CVE classes)* +- **SPARQL operations cannot escape the dataset's authorised scope.** An anonymous/low-priv query cannot read graphs, write data, reach the filesystem, or make Fuseki act as an SSRF proxy beyond what the dataset config permits. *Violation symptom:* SERVICE-SSRF, `file:` read, cross-graph read, or unauthorised write from an in-scope query. *Severity:* CVE-class. *(inferred — the core boundary to ratify; these are the classic Jena CVE classes)* - **RDF parsing is safe against untrusted documents.** RIOT parsing of attacker RDF does not resolve external entities (XXE), execute code, or recurse/allocate unboundedly. *Violation symptom:* XXE, SSRF, or DoS from a parsed RDF document. *Severity:* CVE-class. *(inferred)* -- **Resource bounds — UNSPECIFIED.** Whether an expensive SPARQL query (Andy's volume concern) or a large RDF body is a bug or an operator-tuned limit (query timeout) is open. *(inferred)* +- **Resource bounds.** Expensive SPARQL queries, or large RDF bodies are an operator-tuned limit (query timeout). *(maintainer)* ## §9 Security properties the project does *not* provide @@ -130,7 +131,7 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* - **No defense when ARQ JavaScript/custom functions are enabled on an untrusted endpoint** — enabling code-executing functions and exposing them to anonymous queries is operator-chosen code execution (by-design, like a trusted extension), pending confirmation. *(inferred)* **False friend:** a SPARQL endpoint being "read-only" does not by itself prevent SSRF (`SERVICE`) or local-file read (`file:`) unless those are separately restricted. - **No SPARQL-injection defense for the embedding application** — an app that concatenates untrusted input into a query string owns that bug (use parameterised queries / `QueryBuilder`). *(inferred)* - **No transport security / authentication unless the operator configures Shiro + TLS.** *(documented/inferred)* -- **No generic-DoS / query-complexity guarantee** beyond a to-be-stated line. *(inferred)* +- **No generic-DoS / query-complexity guarantee** beyond query time limits. *(inferred)* - **Well-known classes left to the caller/operator:** SSRF via `SERVICE`, local-file disclosure via `file:`/FROM, XXE in RDF/XML, SPARQL injection (embedding app), and algorithmic-complexity DoS via crafted queries. *(inferred — Jena's published CVE history clusters here; confirm in §14)* ## §10 Downstream responsibilities (operator/deployer) @@ -160,7 +161,7 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* - "Fuseki SPARQL endpoint is open without auth" — public **query** is the documented default; restricting it is the operator's Shiro config. A report is `VALID` only if a *configured* restriction is bypassed or an *update/admin* surface is anonymously reachable. *(documented default)* - "Default `admin`/`pw`, no TLS" — the example setup, documented as not-for-production → `OUT-OF-MODEL: non-default-build`. *(documented)* -- "SPARQL query consumes lots of CPU/memory" — pending the §8 resource line; likely operator-tuned (query timeout) unless super-linear on a small query. *(inferred)* +- "SPARQL query consumes lots of CPU/memory" — pending the §8 resource line; likely operator-tuned (query timeout). *(inferred)* - "ARQ can call JavaScript / custom functions" — only if the operator enabled them; on a trusted/admin endpoint that's by-design. `OUT-OF-MODEL: trusted-input` / `non-default-build` unless reachable anonymously. *(inferred — confirm the default)* - "Embedding app built an injectable SPARQL string" — the app's bug, not Jena's. `OUT-OF-MODEL: trusted-input`. *(inferred)* From 77da757c13b83e963d8697dff693b07020506342 Mon Sep 17 00:00:00 2001 From: Jarek Potiuk Date: Sun, 14 Jun 2026 03:38:42 +0200 Subject: [PATCH 3/6] Fold Jena PMC review into the threat model (v1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rob Vesse (@rvesse) and Andy Seaborne (@afs) reviewed #3966, folded their inline suggestions in, and answered all §14 questions. This promotes the confirmed claims to (maintainer) and rewrites §14 as the resolved record, plus the substantive additions/corrections: - SERVICE federation: enabled by default, no allow-list yet; SSRF from an anonymous query is conceded VALID (not merely disclaimed). - FROM/FROM NAMED: with TDB2 these are in-dataset graph names, NOT fetched; only an arbitrary-URI-configured dataset reads file:/remote. - JSON-LD @context remote-file fetch added as a distinct remote-read/SSRF surface; RDF/XML XXE off by default (afs). - jena-text (Lucene) text index in scope, reachable via text:query. - JS/custom functions = by-design-operator-enabled (allow-list; eval blocked). - Resource/DoS operator-tuned; super-linear framing dropped (inherent to SPARQL). - §11a: added the TDB-FAQ recurring non-findings (memory growth, on-disk size). Generated-by: Claude Opus 4.8 (1M context) --- THREAT_MODEL.md | 61 ++++++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 28 deletions(-) diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index d6d23987d18..5c268ea3cb5 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -19,11 +19,11 @@ limitations under the License. ## §1 Header - **Project:** Apache Jena (`apache/jena`), `main`, against which this draft was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL/ShEx, GeoSPARQL, text index) **and** the Fuseki HTTP server (`jena-fuseki2`). -- **Date:** 2026-06-02. **Status:** draft — for Apache Jena PMC review. **Author:** ASF Security team (drafted via the Scovetta threat-model rubric), for PMC ratification. +- **Date:** 2026-06-02 (v0); 2026-06 (v1, PMC-reviewed). **Status:** v1 — reviewed by the Jena PMC (Rob Vesse + Andy Seaborne) on apache/jena#3966; their inline suggestions are folded in and their answers to the §14 questions promote the load-bearing claims to *(maintainer)* (see §14). **Author:** ASF Security team (drafted via the Scovetta rubric); now PMC-reviewed. - **Version binding:** versioned with the project; a report against version *N* is triaged against the model as it stood at *N*. - **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. - **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question. -- **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred. +- **Confidence:** v1, PMC-reviewed — rvesse + afs answered every §14 question and folded their inline edits in; the high-value query-surface claims (SERVICE/SSRF, file/URI, JS functions, XXE, JSON-LD remote-context, the resource line) are now *(maintainer)*. - **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* ## §2 Scope and intended use @@ -44,7 +44,7 @@ limitations under the License. | Fuseki HTTP server | `jena-fuseki2` — SPARQL query / Update / Graph Store Protocol, admin `/$/*` | network (listens) | **In — primary boundary** *(documented)* | | SPARQL engine (ARQ) | `jena-arq` — query/update eval, `SERVICE` federation, custom functions | network out (SERVICE), file (file: URLs) | **In — high value** *(inferred)* | | RDF I/O (RIOT) | `jena-arq`/`jena-core` parsers (RDF/XML, Turtle, JSON-LD, …) | parses untrusted RDF | **In — XXE / parser-DoS surface** *(inferred)* | -| Stores | `jena-tdb1`, `jena-tdb2`, `jena-text` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | +| Stores + text index | `jena-tdb1`, `jena-tdb2`; **`jena-text` (Lucene)** | filesystem | **In.** On-disk store is operator-trusted and private to the owning process *(maintainer)*; the Lucene text index is reachable from SPARQL via `text:query` — an in-model query surface *(maintainer — afs flagged jena-text)* | | IRI / langtag | `jena-iri3986`, `jena-langtag`, `jena-base` | none | **In (input parsing)** *(inferred)* | | Validation / extensions | `jena-shacl`, `jena-shex`, `jena-geosparql`, `jena-serviceenhancer` | SERVICE | **In (reachable from queries)** *(inferred)* | | Client/API helpers | `jena-rdfconnection`, `jena-querybuilder`, `jena-rdfpatch`, `jena-commonsrdf`, `jena-ontapi` | none | **In as libraries (memory/correctness)** *(inferred)* | @@ -63,7 +63,7 @@ limitations under the License. - **Primary boundary: the Fuseki SPARQL endpoint.** Queries arrive over HTTP from (by default) **anonymous** clients. The boundary question is what an anonymous/low-privilege SPARQL query can reach: read data it shouldn't, **write** (SPARQL Update / GSP) without authorisation, make Fuseki issue outbound requests (`SERVICE` → SSRF), read local files (`file:` URLs / FROM), execute code (ARQ custom/JavaScript functions if enabled), or exhaust resources. *(inferred; public-query default documented)* - **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network (without configuring authentication/authorisation) is an operator misconfiguration. -- **RDF-parse boundary:** any endpoint that **parses** caller-supplied RDF (Update bodies, GSP PUT/POST, content negotiation) runs RIOT on untrusted bytes — the XXE (RDF/XML), JSON-LD Context, and parser-DoS surface. *(inferred)* +- **RDF-parse boundary:** any endpoint that **parses** caller-supplied RDF (Update bodies, GSP PUT/POST, content negotiation) runs RIOT on untrusted bytes. RDF/XML external-entity (XXE) processing is **off by default** *(maintainer — afs)*. **JSON-LD `@context` resolution, however, fetches remote files by default** — a remote-read/SSRF surface inherited from the JSON-LD dependency (W3C JSON-LD WG mitigation in progress); safer than XXE but real *(maintainer — afs)*. Plus the general parser-DoS surface *(inferred)*. - **Reachability preconditions:** - A finding in ARQ/RIOT/stores is **in-model** iff reachable from a Fuseki request at the relevant role (default: anonymous query; authenticated for Update). *(inferred)* - A finding reachable only through the **in-process Java API** with caller-supplied trusted input is `OUT-OF-MODEL: trusted-input` (the embedding app owns it). *(inferred)* @@ -100,9 +100,10 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* | --- | --- | --- | --- | | Fuseki SPARQL query endpoint | SPARQL query text | **yes (anonymous by default)** | Shiro ACLs if data is sensitive; SERVICE/file/JS-function restrictions; query timeout | | Fuseki SPARQL Update / GSP | update text / RDF body | **yes — must be authorised** | read-only-by-default or Shiro-gated write; RDF parse hardening | -| RDF parse (RIOT) anywhere | RDF/XML, Turtle, JSON-LD, … | **yes** | external-entity (XXE) off; bounded nesting/size | -| `SERVICE ` in a query | target URL | **yes** | Disable if not desired; SSRF egress controls / allow-list if enabled | -| `FROM` / `FROM NAMED` / `file:` URI | dataset URI | **yes** | block `file:` and arbitrary fetch from untrusted queries | +| RDF parse (RIOT) anywhere | RDF/XML, Turtle, JSON-LD, … | **yes** | RDF/XML XXE off by default *(maintainer)*; bounded nesting/size | +| JSON-LD `@context` resolution (RIOT) | `@context` URL | **yes** | remote-context fetch is **on by default** (SSRF / remote-read) — restrict on untrusted-input endpoints *(maintainer — afs)* | +| `SERVICE ` (federation) | target URL | **yes — enabled by default; SSRF is conceded `VALID`** *(maintainer)* | no allow-list yet — disable `SERVICE` or add egress controls on untrusted endpoints | +| `FROM` / `FROM NAMED` URI | dataset URI | **dataset-impl-dependent** *(maintainer)* | with TDB2 these are in-dataset graph names, **not fetched**; only an arbitrary-URI-configured dataset reads `file:`/remote | | Fuseki admin `/$/*` | dataset mgmt, backups | **must not be on the public net** | localhost-only (default) / operator network | | Java API (`QueryExecution`, `Model.read`) | query / RDF from the app | no — the embedding app's trust | app validates its own untrusted inputs | @@ -128,11 +129,11 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* ## §9 Security properties the project does *not* provide - **No protection if the operator exposes the admin surface, ships the example `admin`/`pw` setup, or runs without TLS** — deployment hardening (pending §5a rulings). *(documented that the example setup is not for production)* -- **No defense when ARQ JavaScript/custom functions are enabled on an untrusted endpoint** — enabling code-executing functions and exposing them to anonymous queries is operator-chosen code execution (by-design, like a trusted extension), pending confirmation. *(inferred)* **False friend:** a SPARQL endpoint being "read-only" does not by itself prevent SSRF (`SERVICE`) or local-file read (`file:`) unless those are separately restricted. +- **No defense when ARQ JavaScript/custom functions are enabled on an untrusted endpoint** — JS functions are opt-in with an explicit allow-list (`eval()` etc. blacklisted regardless); custom Java functions require the operator to add trusted code to the class path. Reachable code execution is **`by-design-operator-enabled`** *(maintainer — rvesse)*. **False friend:** a SPARQL endpoint being "read-only" does not by itself prevent SSRF (`SERVICE`) unless `SERVICE` is separately restricted. - **No SPARQL-injection defense for the embedding application** — an app that concatenates untrusted input into a query string owns that bug (use parameterised queries / `QueryBuilder`). *(inferred)* - **No transport security / authentication unless the operator configures Shiro + TLS.** *(documented/inferred)* -- **No generic-DoS / query-complexity guarantee** beyond query time limits. *(inferred)* -- **Well-known classes left to the caller/operator:** SSRF via `SERVICE`, local-file disclosure via `file:`/FROM, XXE in RDF/XML, SPARQL injection (embedding app), and algorithmic-complexity DoS via crafted queries. *(inferred — Jena's published CVE history clusters here; confirm in §14)* +- **No generic-DoS / query-complexity guarantee** beyond operator-set query timeouts + reverse-proxy size limits; a trivial query can compute a huge cross-product, which is inherent to spec-compliant SPARQL, not a Jena bug *(maintainer — rvesse)*. +- **Well-known classes — note the split:** **SSRF via `SERVICE`** from an anonymous query against the default config is a **conceded `VALID` attack vector** *(maintainer — rvesse)*, not merely operator-disclaimed (there is no allow-list yet; operators must disable `SERVICE` or add egress controls). Left to the caller/operator: SPARQL injection (embedding app), algorithmic-complexity DoS (operator timeouts), and — only for a dataset explicitly configured for arbitrary-URI access — `file:`/remote read. With a TDB2 store, `FROM`/`FROM NAMED` access only in-dataset graphs and do **not** fetch URIs *(maintainer — rvesse)*. JSON-LD `@context` remote fetch is a separate remote-read surface (§4/§6). ## §10 Downstream responsibilities (operator/deployer) @@ -161,9 +162,11 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* - "Fuseki SPARQL endpoint is open without auth" — public **query** is the documented default; restricting it is the operator's Shiro config. A report is `VALID` only if a *configured* restriction is bypassed or an *update/admin* surface is anonymously reachable. *(documented default)* - "Default `admin`/`pw`, no TLS" — the example setup, documented as not-for-production → `OUT-OF-MODEL: non-default-build`. *(documented)* -- "SPARQL query consumes lots of CPU/memory" — pending the §8 resource line; likely operator-tuned (query timeout). *(inferred)* -- "ARQ can call JavaScript / custom functions" — only if the operator enabled them; on a trusted/admin endpoint that's by-design. `OUT-OF-MODEL: trusted-input` / `non-default-build` unless reachable anonymously. *(inferred — confirm the default)* -- "Embedding app built an injectable SPARQL string" — the app's bug, not Jena's. `OUT-OF-MODEL: trusted-input`. *(inferred)* +- "SPARQL query consumes lots of CPU/memory" — operator-tuned via query timeout + reverse-proxy size limits; a tiny query computing a huge cross-product is inherent to spec-compliant SPARQL (affects every engine), not a Jena bug *(maintainer — rvesse)*. +- "ARQ can call JavaScript / custom functions" — only if the operator enabled them, with a JS allow-list / operator-supplied Java code; `by-design-operator-enabled`. `OUT-OF-MODEL: trusted-input` / `non-default-build` unless reachable anonymously *(maintainer — rvesse)*. +- "Embedding app built an injectable SPARQL string" — the app's bug, not Jena's; parameterised queries are the recommended pattern. `OUT-OF-MODEL: trusted-input` *(maintainer — rvesse)*. +- "Fuseki/TDB has a memory leak" — unbounded memory growth under continuous read/write load is a known issue; the WAL guarantees no data loss on crash/restart. A recurring mailing-list topic, not a vulnerability *(maintainer — rvesse; TDB FAQ)*. +- "TDB database is far larger on disk than the input" — sparse files (metrics vary by tool/filesystem) plus TDB2's MVCC trees orphaning old blocks per write; expected. The `compaction` operation reclaims space (run periodically) *(maintainer — rvesse; TDB FAQ)*. ## §12 Conditions that would change this model @@ -186,21 +189,23 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* | `KNOWN-NON-FINDING` | Matches a §11a entry. | §11a | | `MODEL-GAP` | Cannot be routed — triggers §12. | §12 | -## §14 Open questions for the maintainers +## §14 Open questions — RESOLVED by the Jena PMC (2026-06) -**Wave 1 — scope & Fuseki defaults:** -1. Confirm scope is the `apache/jena` monorepo with **Fuseki + ARQ + RIOT + TDB** as the in-model core, and `jena-examples`/tests/benchmarks out. → §2/§3. -2. **SPARQL Update / Graph Store write default:** does a Fuseki dataset ship **read-only** by default, or can it be update-enabled-and-unauthenticated? (Decides whether anonymous write is in-model or a misconfig.) → §5a/§8. -3. Confirm the documented default — public query, localhost-only admin, example `admin`/`pw` is not-for-production (`non-default-build`). → §5a/§11a. +Reviewed on [apache/jena#3966](https://github.com/apache/jena/pull/3966) by **Rob Vesse (`@rvesse`)** and **Andy Seaborne (`@afs`)**, who folded their inline suggestions into the model and answered the open questions. Confirmed claims are promoted to *(maintainer)*; the answers below are the durable record. -**Wave 2 — the high-value query surfaces (the Jena CVE classes):** -4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. -5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. -6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. -7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. +**Wave 1 — scope & Fuseki defaults** +1. **Scope** *(maintainer)*: the `apache/jena` monorepo with Fuseki + ARQ + RIOT + TDB as the in-model core, `jena-examples`/tests/benchmarks out. afs added that the **Lucene-based text index (`jena-text`)**, reachable from SPARQL via `text:query`, is in scope (§2/§6). +2. **Update default** *(maintainer — rvesse)*: deployment-dependent. Fuseki started **without a config file** is **read-only unless `--update`** is passed. With a config file, only the operations the config declares are available (Update must be explicitly configured) — though the documentation's example configs do include update services. +3. **Defaults** *(maintainer)*: confirmed — public SPARQL query, localhost-only admin, example `admin`/`pw` + no-TLS explicitly not-for-production (`non-default-build`). -**Wave 3 — resources, API, meta:** -8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. -9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. -10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. -11. **Meta:** Jena has no in-repo `SECURITY.md`/`AGENTS.md`; this engagement adds them + `THREAT_MODEL.md`, wiring `AGENTS.md → SECURITY.md → THREAT_MODEL.md`. The Fuseki security docs live on the website. Confirm the in-repo model is canonical and references the website docs; confirm revision ownership. → §1. +**Wave 2 — high-value query surfaces (the Jena CVE classes)** +4. **`SERVICE` federation (SSRF)** *(maintainer — rvesse)*: **enabled by default**, **disableable** in config, **no allow-list capability currently** (a noted hardening gap; the Service Enhancer module may help). The PMC **concedes SSRF via `SERVICE` is a valid attack vector** and that the docs should call it out more explicitly → an SSRF from an anonymous query against a default config is **`VALID`**, not merely disclaimed. +5. **`file:` / arbitrary-URI read** via `FROM`/`FROM NAMED`/`SERVICE` *(maintainer — rvesse)*: **depends on the dataset implementation.** With a persistent store like **TDB2, `FROM`/`FROM NAMED` only resolve graphs *within the dataset* — the URIs are treated as graph names and are *not* fetched.** Fuseki *can* be configured to allow arbitrary-URI access (a documentation/hardening point), but that is not the TDB-backed default. +6. **ARQ JavaScript / custom functions** *(maintainer — rvesse)*: **opt-in**, with an **explicit allow-list of permitted JS functions** (`eval()` etc. blacklisted regardless of the allow-list). Custom **Java** functions require the operator to add code to the class path — operator responsibility to trust it. Reachable code execution is **`by-design-operator-enabled`**, not a Jena vulnerability. +7. **RIOT / RDF-XML XXE** *(maintainer — afs's area; parsers rewritten recently)*: external-entity processing is **off by default** and afs (the authority here) believes the parsers are safe against untrusted RDF. **JSON-LD context resolution, however, fetches remote files by default** — a behavior of the JSON-LD dependency (the W3C JSON-LD WG is working on documenting/mitigating it); safer than XXE but a genuine remote-read/SSRF surface (§4/§6). + +**Wave 3 — resources, API, meta** +8. **Resource/DoS** *(maintainer — rvesse)*: **operator-tuned** via query timeout (server or per-dataset) + reverse-proxy request-size limits. Super-linear cost is **not a bug** — a tiny query can compute a massive cross-product (`SELECT * WHERE { ?a ?b ?c . ?d ?e ?f . ?g ?h ?i }`), and as a spec-compliant SPARQL engine Jena is no different from any other here. (The earlier "super-linear" framing is dropped.) +9. **In-process Java API** *(maintainer — rvesse)*: trusted-caller — an embedding app that concatenates untrusted input into SPARQL owns that injection; **parameterised queries are the recommended pattern**. +10. **§11a recurring non-findings** *(maintainer — rvesse, from the TDB FAQ)*: (a) **"Fuseki/TDB has a memory leak"** — unbounded memory growth under continuous read/write load is a known issue; the WAL ensures no data is lost on crash/restart. (b) **"Database is much larger on disk than the input"** — sparse files (disk-usage metrics vary by tool/filesystem) plus TDB2's MVCC trees orphaning old blocks on each write; expected, and a **compaction** operation reclaims the space (recommended periodically). +11. **Meta** *(maintainer)*: the in-repo `THREAT_MODEL.md` is canonical and references the website Fuseki security docs; the PMC owns revision. From d9a497f766ce88d0992836aed09fd8c76515be89 Mon Sep 17 00:00:00 2001 From: Rob Vesse Date: Tue, 16 Jun 2026 11:08:22 +0100 Subject: [PATCH 4/6] Apply suggestions from code review Applies various clarifications from @afs Co-authored-by: Andy Seaborne --- THREAT_MODEL.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index 5c268ea3cb5..b8eb0d8bd58 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -14,17 +14,17 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Apache Jena — Threat Model (v0 draft) +# Apache Jena — Threat Model (v1) ## §1 Header -- **Project:** Apache Jena (`apache/jena`), `main`, against which this draft was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL/ShEx, GeoSPARQL, text index) **and** the Fuseki HTTP server (`jena-fuseki2`). +- **Project:** Apache Jena (`apache/jena`), `main`, against which this draft was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL (`jena-shacl`), ShEx (`jena-shex`), GeoSPARQL `jena-geosparql`, text index `jena-text` **and** the Fuseki HTTP server (`jena-fuseki2`). - **Date:** 2026-06-02 (v0); 2026-06 (v1, PMC-reviewed). **Status:** v1 — reviewed by the Jena PMC (Rob Vesse + Andy Seaborne) on apache/jena#3966; their inline suggestions are folded in and their answers to the §14 questions promote the load-bearing claims to *(maintainer)* (see §14). **Author:** ASF Security team (drafted via the Scovetta rubric); now PMC-reviewed. - **Version binding:** versioned with the project; a report against version *N* is triaged against the model as it stood at *N*. - **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. - **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question. - **Confidence:** v1, PMC-reviewed — rvesse + afs answered every §14 question and folded their inline edits in; the high-value query-surface claims (SERVICE/SSRF, file/URI, JS functions, XXE, JSON-LD remote-context, the resource line) are now *(maintainer)*. -- **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* +- **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications using RDF and SPARQL. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* ## §2 Scope and intended use @@ -46,7 +46,8 @@ limitations under the License. | RDF I/O (RIOT) | `jena-arq`/`jena-core` parsers (RDF/XML, Turtle, JSON-LD, …) | parses untrusted RDF | **In — XXE / parser-DoS surface** *(inferred)* | | Stores + text index | `jena-tdb1`, `jena-tdb2`; **`jena-text` (Lucene)** | filesystem | **In.** On-disk store is operator-trusted and private to the owning process *(maintainer)*; the Lucene text index is reachable from SPARQL via `text:query` — an in-model query surface *(maintainer — afs flagged jena-text)* | | IRI / langtag | `jena-iri3986`, `jena-langtag`, `jena-base` | none | **In (input parsing)** *(inferred)* | -| Validation / extensions | `jena-shacl`, `jena-shex`, `jena-geosparql`, `jena-serviceenhancer` | SERVICE | **In (reachable from queries)** *(inferred)* | +| Extensions | `jena-geosparql`, `jena-serviceenhancer` | SERVICE | **In (reachable from queries)** *(inferred)* | +| Validations | `jena-shacl`, `jena-shex` | HTTP GET requests | | Client/API helpers | `jena-rdfconnection`, `jena-querybuilder`, `jena-rdfpatch`, `jena-commonsrdf`, `jena-ontapi` | none | **In as libraries (memory/correctness)** *(inferred)* | | CLI tools | `jena-cmds` | filesystem | **In iff fed untrusted input; usually operator-run** *(inferred)* | | Examples / tests / benchmarks | `jena-examples`, `jena-integration-tests`, `jena-benchmarks` | n/a | **Out** *(see §3)* | @@ -107,7 +108,7 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* | Fuseki admin `/$/*` | dataset mgmt, backups | **must not be on the public net** | localhost-only (default) / operator network | | Java API (`QueryExecution`, `Model.read`) | query / RDF from the app | no — the embedding app's trust | app validates its own untrusted inputs | -- **Size/shape/rate:** query-cost / result-size / parser-nesting bounds — to confirm (Andy's volume concern); §8 resource line. *(inferred)* +- **Size/shape/rate:** query-cost / result-size / parser-nesting bounds; §8 resource line. *(inferred)* ## §7 Adversary model @@ -165,7 +166,7 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* - "SPARQL query consumes lots of CPU/memory" — operator-tuned via query timeout + reverse-proxy size limits; a tiny query computing a huge cross-product is inherent to spec-compliant SPARQL (affects every engine), not a Jena bug *(maintainer — rvesse)*. - "ARQ can call JavaScript / custom functions" — only if the operator enabled them, with a JS allow-list / operator-supplied Java code; `by-design-operator-enabled`. `OUT-OF-MODEL: trusted-input` / `non-default-build` unless reachable anonymously *(maintainer — rvesse)*. - "Embedding app built an injectable SPARQL string" — the app's bug, not Jena's; parameterised queries are the recommended pattern. `OUT-OF-MODEL: trusted-input` *(maintainer — rvesse)*. -- "Fuseki/TDB has a memory leak" — unbounded memory growth under continuous read/write load is a known issue; the WAL guarantees no data loss on crash/restart. A recurring mailing-list topic, not a vulnerability *(maintainer — rvesse; TDB FAQ)*. +- "Fuseki/TDB1 has a memory leak" — unbounded memory growth under continuous read/write load is a known issue; the WAL guarantees no data loss on crash/restart. A recurring mailing-list topic, not a vulnerability *(maintainer — rvesse; TDB FAQ)*. - "TDB database is far larger on disk than the input" — sparse files (metrics vary by tool/filesystem) plus TDB2's MVCC trees orphaning old blocks per write; expected. The `compaction` operation reclaims space (run periodically) *(maintainer — rvesse; TDB FAQ)*. ## §12 Conditions that would change this model From cc84d1e285dc897051b8f98909a6c374de7aa753 Mon Sep 17 00:00:00 2001 From: Rob Vesse Date: Wed, 17 Jun 2026 14:27:27 +0100 Subject: [PATCH 5/6] Update THREAT_MODEL.md Co-authored-by: Andy Seaborne --- THREAT_MODEL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index b8eb0d8bd58..2635349512b 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -18,7 +18,7 @@ limitations under the License. ## §1 Header -- **Project:** Apache Jena (`apache/jena`), `main`, against which this draft was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL (`jena-shacl`), ShEx (`jena-shex`), GeoSPARQL `jena-geosparql`, text index `jena-text` **and** the Fuseki HTTP server (`jena-fuseki2`). +- **Project:** Apache Jena (`apache/jena`), `main`, against which this threat model was written. A monorepo: the RDF/SPARQL Java framework (`jena-core`, `jena-arq`, `jena-base`, RIOT parsers, `jena-tdb1`/`jena-tdb2` stores, SHACL (`jena-shacl`), ShEx (`jena-shex`), GeoSPARQL `jena-geosparql`, text index `jena-text` **and** the Fuseki HTTP server (`jena-fuseki2`). - **Date:** 2026-06-02 (v0); 2026-06 (v1, PMC-reviewed). **Status:** v1 — reviewed by the Jena PMC (Rob Vesse + Andy Seaborne) on apache/jena#3966; their inline suggestions are folded in and their answers to the §14 questions promote the load-bearing claims to *(maintainer)* (see §14). **Author:** ASF Security team (drafted via the Scovetta rubric); now PMC-reviewed. - **Version binding:** versioned with the project; a report against version *N* is triaged against the model as it stood at *N*. - **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. From 151ede67fc6755a74789c0b15ae404558dda83a4 Mon Sep 17 00:00:00 2001 From: Jarek Potiuk Date: Thu, 18 Jun 2026 17:59:06 -0400 Subject: [PATCH 6/6] THREAT_MODEL.md: apply afs's final review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - complete the Validations row (SHACL/ShEx imports = HTTP-fetch SSRF surface) - drop the "(Draft one-liners)" placeholder in §11 - §11: drop standalone `file:` from the SERVICE misuse line (file: in a query is a URI name, not a dereferenced local file, per afs) Generated-by: Claude Opus 4.8 (1M context) --- THREAT_MODEL.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index 2635349512b..150f42c601a 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -47,7 +47,7 @@ limitations under the License. | Stores + text index | `jena-tdb1`, `jena-tdb2`; **`jena-text` (Lucene)** | filesystem | **In.** On-disk store is operator-trusted and private to the owning process *(maintainer)*; the Lucene text index is reachable from SPARQL via `text:query` — an in-model query surface *(maintainer — afs flagged jena-text)* | | IRI / langtag | `jena-iri3986`, `jena-langtag`, `jena-base` | none | **In (input parsing)** *(inferred)* | | Extensions | `jena-geosparql`, `jena-serviceenhancer` | SERVICE | **In (reachable from queries)** *(inferred)* | -| Validations | `jena-shacl`, `jena-shex` | HTTP GET requests | +| Validations | `jena-shacl`, `jena-shex` | HTTP GET requests (imports) | **In (import-fetch = SSRF surface)** *(maintainer — afs)* | | Client/API helpers | `jena-rdfconnection`, `jena-querybuilder`, `jena-rdfpatch`, `jena-commonsrdf`, `jena-ontapi` | none | **In as libraries (memory/correctness)** *(inferred)* | | CLI tools | `jena-cmds` | filesystem | **In iff fed untrusted input; usually operator-run** *(inferred)* | | Examples / tests / benchmarks | `jena-examples`, `jena-integration-tests`, `jena-benchmarks` | n/a | **Out** *(see §3)* | @@ -148,10 +148,8 @@ Per-surface trust table *(Fuseki defaults documented; the rest inferred):* ## §11 Known misuse patterns -*(Draft one-liners — expand before publishing.)* - - Exposing a public, update-enabled SPARQL endpoint with no auth. *(inferred)* -- Leaving `SERVICE`/`file:` reachable from anonymous queries (SSRF / file read). *(inferred)* +- Leaving `SERVICE` reachable from anonymous queries (SSRF / file read). *(inferred)* - Enabling ARQ JS functions on a public endpoint. *(inferred)* - Shipping the example `admin`/`pw` / no-TLS Fuseki setup to production. *(documented as not-for-prod)* - Building SPARQL by concatenating untrusted strings in an embedding app. *(inferred)*