Problem statement:
ExecuTorch can run LLMs locally, but today it is not structured like a serving system. Local AI use cases such as tool calling, chat assistants, and Pi-style integrations need more than a
single blocking generation call. They need reliable request handling, clear API behavior, streaming, tool-call support, and a way to handle multiple conversations without duplicating model
weights or corrupting model state.
The current path makes it hard to safely support concurrent or interleaved requests, especially when KV cache and generation state are tied too closely to one runner invocation. This
limits ExecuTorch’s ability to compete with local serving stacks like llama.cpp server and Ollama for practical on-device assistant workloads.
This issue tracks making ExecuTorch’s LLM runtime usable as a correctness-first local serving backend: predictable behavior, OpenAI-compatible integration, robust tool use, and a
foundation that can grow toward better performance over time.
Why this matters:
Local AI users expect to point clients at a server and get dependable chat/tool behavior. If ExecuTorch wants to be viable in that role, the serving path needs to be explicit, tested, and safe under real request patterns rather than only optimized for one-shot generation demos.
Problem statement:
ExecuTorch can run LLMs locally, but today it is not structured like a serving system. Local AI use cases such as tool calling, chat assistants, and Pi-style integrations need more than a
single blocking generation call. They need reliable request handling, clear API behavior, streaming, tool-call support, and a way to handle multiple conversations without duplicating model
weights or corrupting model state.
The current path makes it hard to safely support concurrent or interleaved requests, especially when KV cache and generation state are tied too closely to one runner invocation. This
limits ExecuTorch’s ability to compete with local serving stacks like llama.cpp server and Ollama for practical on-device assistant workloads.
This issue tracks making ExecuTorch’s LLM runtime usable as a correctness-first local serving backend: predictable behavior, OpenAI-compatible integration, robust tool use, and a
foundation that can grow toward better performance over time.
Why this matters:
Local AI users expect to point clients at a server and get dependable chat/tool behavior. If ExecuTorch wants to be viable in that role, the serving path needs to be explicit, tested, and safe under real request patterns rather than only optimized for one-shot generation demos.