llama-swap

Description

Model swapping for llama.cpp (or any local OpenAPI compatible server)

llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.

When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to the correct one to serve the request.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

Version

License

Status

Source

Homepage

Maintainers (2)

Platforms (29)