34 lines (25 loc) · 1.32 KB

How to Run Distributed Llama on 🧠 GPU

Distributed Llama can run on GPU devices using Vulkan API. This article describes how to build and run the project on GPU.

Before you start here, please check how to build and run Distributed Llama on CPU:

To run on GPU, please follow these steps:

Install Vulkan SDK for your platform.

Linux: please check this article.
MacOS: download SDK here.

Build Distributed Llama with GPU support:

DLLAMA_VULKAN=1 make dllama
DLLAMA_VULKAN=1 make dllama-api

Now dllama and dllama-api binaries supports arguments related to GPU usage.

--gpu-index <index>   Use GPU device with given index (use `0` for first device)

You can run the root node or worker node on GPU by specifying the --gpu-index argument. Vulkan backend requires single thread, so you should also set --nthreads 1.

./dllama inference ... --nthreads 1 --gpu-index 0 
./dllama chat      ... --nthreads 1 --gpu-index 0 
./dllama worker    ... --nthreads 1 --gpu-index 0 
./dllama-api       ... --nthreads 1 --gpu-index 0