Run a vLLM server locally
First, install the vLLM using the instructions provided by vLLM. You can then launch the vLLM from the command line, for example:Good to know: to disable thinking mode,
--reasoning-parser argument needs to be removed.Deploy via Docker
First, make sure you’ve met the following pre-requisites:- An NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit to allow Docker to access your GPU
- Docker installed and running
Good to know
- To disable thinking mode,
--reasoning-parserargument needs to be removed. - To run Holo2 8B, change —model to HCompany/Holo2-8B.
- To run Holo2 30B A3B, change —model to HCompany/Holo2-30B-3AB and add —tensor-parallel-size 2
Holo2 reasoning parser compatibility
Holo2 models are reasoning models. In order to extract reasoning content for a request, you need to set the--reasoning-parser accordingly in vLLM (docker or vLLM serve).
Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. So if --reasoning-parser is not provided, it is required to disable thinking at the request level. Here the compatibility grid for --reasoning-parser arg
| Parser | Features | Request-level argument |
|---|---|---|
| deepseek_r1 | Thinking mode enabled; structured output supported | |
| None | Thinking mode disabled; structured output supported | {"chat_template_kwargs": {"thinking": false }}` |
Invoking Holo2 via API
When vLLM is running, you can send requests to:Test with curl - with deepseek_r1 reasoning parser (thinking mode)
Test with curl - without reasoning parser (no thinking mode)
Test with Python (OpenAI SDK)
First, install the OpenAI client.Good to know
- The API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.
--modelcan be set toHCompany/Holo2-4B,HCompany/Holo2-8BorHCompany/Holo2-30B-A3B--gpus=allenables all NVIDIA GPUs for the container.- Holo2 is a multimodal model, so you can adjust image limits using
--limit-mm-per-prompt. - Reduce
--max-model-lenor--gpu-memory-utilizationif your GPU runs out of memory. - Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
- Port 8000 must be free; change it with
-p <host>:8000if needed.
.png)
