Model swapping for llama.cpp (or any local OpenAPI compatible server)
llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model
value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to the correct one to serve the request.
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups
feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.