Deploying LLMs with vLLM
vLLM is a high-performance system designed to accelerate the serving of large language models (LLMs), making them more efficient and scalable for real-world applications. Developed by researchers at UC Berkeley, vLLM aims to overcome the limitations that existing inference systems face, particularly when serving modern LLMs like GPT-3 and GPT-4. The core innovation in vLLM is its novel memory management system, which is tailored for optimizing the use of GPU memory during the inference process.
- Create a Deployment
- For the command put something like
vllm serve lmsys/vicuna-7b-v1.5 --dtype half --quantization bitsandbytes --load-format bitsandbytes
- Choose a GPU instance type
- Choose the
saturn-python-llm
image, version2024.08.01
- click save.
Click “start” to deploy your LLM. Please see the section on deployments to understand how to authenticate with this deployment, as well as restrict access to it.
VLLm serve options
As long as your model architecture is supported you should be able to serve your model with vLLM. The parameters --dtype half --quantization bitsandbytes --load-format bitsandbytes
are recommended in order to reduce the GPU memory foot print of some of the larger models.