Skip to content

Make ExecuTorch usable as a reliable local AI server #20001

@mergennachin

Description

@mergennachin

Problem statement:

ExecuTorch can run LLMs locally, but today it is not structured like a serving system. Local AI use cases such as tool calling, chat assistants, and Pi-style integrations need more than a
single blocking generation call. They need reliable request handling, clear API behavior, streaming, tool-call support, and a way to handle multiple conversations without duplicating model
weights or corrupting model state.

The current path makes it hard to safely support concurrent or interleaved requests, especially when KV cache and generation state are tied too closely to one runner invocation. This
limits ExecuTorch’s ability to compete with local serving stacks like llama.cpp server and Ollama for practical on-device assistant workloads.

This issue tracks making ExecuTorch’s LLM runtime usable as a correctness-first local serving backend: predictable behavior, OpenAI-compatible integration, robust tool use, and a
foundation that can grow toward better performance over time.

Why this matters:

Local AI users expect to point clients at a server and get dependable chat/tool behavior. If ExecuTorch wants to be viable in that role, the serving path needs to be explicit, tested, and safe under real request patterns rather than only optimized for one-shot generation demos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions