Skip to content

tinyBigGAMES/BoxedLLaMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoxedLLaMA

Put llama.cpp on rails.

Discord Follow on Bluesky

What is BoxedLLaMA?

BoxedLLaMA is a Delphi toolkit that wraps llama.cpp into a managed, batteries-included package for local AI inference on Windows.

Most developers who want local AI face the same wall: download the right binary, figure out the command-line flags, spawn a process, parse HTTP responses, manage the lifecycle, and hope nothing breaks when llama.cpp ships a new release next Tuesday. BoxedLLaMA handles all of it. One toolkit. One managed subprocess. Zero DLL coupling.

Server.AddMessage('user', 'Summarize the Delphi roadmap.');
LResult := Server.ChatCompletionStream(CModelName, LChatConfig);

Three lines to stream a response from a local model. Everything underneath -- the binary, the process, the HTTP plumbing, the response parsing -- is handled.

Features

Feature What It Means
🚀 Automatic server management Downloads, installs, and auto-updates the llama.cpp server binary from GitHub releases. Version policies: auto, pinned, or manual.
💬 Chat completions Synchronous and streaming with full token callback support. Token counts, timing, and generation speed in one result record.
🔧 Tool calling Two-tier architecture: three meta-tools visible to the model, full catalog discovered at runtime. Agentic multi-round tool loop built in.
📐 Embeddings Single and batch generation with cosine similarity. TChat automatically enables embeddings when an embedding model is configured.
🧠 Persistent memory SQLite + FTS5 + HNSW vector index. Hybrid retrieval (keyword + semantic) with automatic recall injection per turn.
📄 Document ingest Paragraph-aware chunking with configurable overlap. Drop a file into memory and let retrieval find the right pieces.
🧭 Session management System prompt invariant, context-budget trimming, history compaction, and two-level persistence (JSON history + KV cache).
📥 HuggingFace models Download, delete, and track GGUF models directly through the server API with SSE progress tracking.
🧪 Reasoning models Configurable thinking tag display for chain-of-thought models. Show, hide, or replace with a placeholder.
GPU offloading Automatic, full, or manual GPU layer control with Vulkan backend. Quantized KV cache (Q4_0, Q8_0) to fit larger contexts in VRAM.
🔄 Auto-updating vpAuto checks GitHub on a configurable interval and updates the server binary silently. Your app always runs on the latest llama.cpp.
🔌 Built on StdApp Console UI, HTTP, JSON, VFS, crypto, and more. One dependency tree, no external packages.

Architecture

Your Application
    |
    v
+-----------------------------------------+
|  BoxedLLaMA Toolkit                     |
|                                         |
|  TChat -----> TSession -----> TServer   |
|    |              |               |     |
|    v              v               v     |
|  TConsoleChat  TMemory    llama-server  |
|  (frontend)   (SQLite+    (managed      |
|               FTS5+HNSW)  subprocess)   |
|                               |         |
|  TToolRegistry <----tool---+  |         |
|  TToolBuilder    calls        |         |
+-----------------------------------------+
    |
    v
Local GGUF Models (Vulkan GPU inference)

📖 Full Documentation -- configuration, API reference, code examples, and architecture details for every module.

Getting Started

  1. Clone the repository:
git clone https://github.com/tinyBigGAMES/BoxedLLaMA.git
  1. Open projects\Testbed\Testbed.dproj in Delphi 12 or higher
  2. Build the Testbed project (Win64 target)
  3. Run -- the server binary downloads automatically on first launch
  4. Place your GGUF model files in C:\Dev\LLM\GGUF (or update the path in projects\Testbed\UTestbed.Common.pas). Single-file models go in the root; multimodal models with a mmproj file get their own subfolder. Reference local models by filename without the .gguf extension.
  5. (Optional) Set the TAVILY_API_KEY environment variable for web search tools. Get a free key at Tavily (1,000 credits/month).

Recommended Models

These vetted models work out of the box with the testbed demos:

Purpose Model Size Download
💬 Chat (multimodal) Gemma 4 E4B Abliterated Q4_K 5.3 GB Download
👁️ Vision projector mmproj for Gemma 4 E4B (bf16) 992 MB Download
📐 Embeddings Qwen3 Embedding 0.6B Q8_0 639 MB Download

System Requirements

Requirement
🖥️ Host OS Windows 10/11 x64
🎮 GPU Vulkan-capable GPU recommended
⚙️ Building from source Delphi 12.x or higher
📦 Runtime dependencies None -- server binary downloaded automatically

Important

This repository is under active development. Follow the repo or join the Discord to track progress.

Contributing

BoxedLLaMA is an open project. Whether you are fixing a bug, improving documentation, or proposing a feature, contributions are welcome.

  • 🐛 Report bugs: Open an issue with a minimal reproduction
  • 💡 Suggest features: Describe the use case first
  • 🔧 Submit pull requests: Bug fixes, documentation improvements, and well-scoped features

Join the Discord to discuss development, ask questions, and share what you are building.

Support the Project

If BoxedLLaMA saves you time or sparks something useful:

  • Star the repo -- helps others find the project
  • 🗣️ Spread the word -- write a post, mention it in a community
  • 💬 Join us on Discord -- share what you are building
  • 💖 Become a sponsor -- sponsorship directly funds development
  • 🦋 Follow on Bluesky -- stay in the loop on releases

License

BoxedLLaMA is licensed under the Apache License 2.0. See LICENSE for details.

Apache 2.0 is a permissive open source license that lets you use, modify, and distribute BoxedLLaMA freely in both open source and commercial projects. You are not required to release your own source code. The license includes an explicit patent grant. Attribution is required -- keep the copyright notice and license file in place.

Links

BoxedLLaMA™ -- Put llama.cpp on rails.

Copyright © 2026-present tinyBigGAMES™ LLC
All Rights Reserved.

About

BoxedLLaMA is a specialized software toolkit designed for developers to integrate local artificial intelligence into Windows applications. It functions as a comprehensive wrapper for llama.cpp, automating complex tasks such as server installation, version updates, and model management from Hugging Face.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors