Skip to main content
This guide shows you how to deploy the Holo2 model with vLLM, on NVIDIA GPUs.

Run a vLLM server locally

First, install the vLLM using the instructions provided by vLLM. You can then launch the vLLM from the command line, for example:
vllm serve Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=deepseek_r1
    --limit-mm-per-prompt={"image": 3, "video": 0}
Good to know: to disable thinking mode, --reasoning-parser argument needs to be removed.

Deploy via Docker

First, make sure you’ve met the following pre-requisites: Next, run the Holo2 model, for example:
docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.11.0 \
    --model Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=deepseek_r1
    --limit-mm-per-prompt={"image": 3, "video": 0}
Good to know
  • To disable thinking mode, --reasoning-parser argument needs to be removed.
  • To run Holo2 8B, change —model to HCompany/Holo2-8B.
  • To run Holo2 30B A3B, change —model to HCompany/Holo2-30B-3AB and add —tensor-parallel-size 2

Holo2 reasoning parser compatibility

Holo2 models are reasoning models. In order to extract reasoning content for a request, you need to set the --reasoning-parser accordingly in vLLM (docker or vLLM serve). Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. So if --reasoning-parser is not provided, it is required to disable thinking at the request level. Here the compatibility grid for --reasoning-parser arg
ParserFeaturesRequest-level argument
deepseek_r1Thinking mode enabled; structured output supported
NoneThinking mode disabled; structured output supported{"chat_template_kwargs": {"thinking": false }}`

Invoking Holo2 via API

When vLLM is running, you can send requests to:
http://localhost:8000/v1/chat/completions

Test with curl - with deepseek_r1 reasoning parser (thinking mode)

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with curl - without reasoning parser (no thinking mode)

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ],
        "chat_template_kwargs": {
            "thinking": false 
        }
    }'

Test with Python (OpenAI SDK)

First, install the OpenAI client.
pip install openai
Next, run the Python script, for example:
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "HCompany/Holo2-4B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

# With deepseek_r1 reasoning parser (thinking mode)
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

# Without reasoning parser (no thinking mode)
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    extra_body={"chat_template_kwargs": {"thinking": False }}
)
Good to know
  • The API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.
  • --model can be set to HCompany/Holo2-4B, HCompany/Holo2-8B or HCompany/Holo2-30B-A3B
  • --gpus=all enables all NVIDIA GPUs for the container.
  • Holo2 is a multimodal model, so you can adjust image limits using --limit-mm-per-prompt.
  • Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
  • Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
  • Port 8000 must be free; change it with -p <host>:8000 if needed.