Name		Name	Last commit message	Last commit date
parent directory ..
amd		amd
tpu		tpu
.dstack.yml		.dstack.yml
README.md		README.md

README.md

title	description
HuggingFace TGI	This example shows how to deploy Llama 4 Scout to any cloud or on-premises environment using HuggingFace TGI and dstack.

HuggingFace TGI

This example shows how to deploy Llama 4 Scout with dstack using HuggingFace TGI :material-arrow-top-right-thin:{ .external }{:target="_blank"}.

??? info "Prerequisites" Once dstack is installed, go ahead clone the repo, and run dstack init.

<div class="termy">

```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
```

</div>

Deployment

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct :material-arrow-top-right-thin:{ .external }{:target="_blank"} using TGI.

type: service
name: llama4-scout

image: ghcr.io/huggingface/text-generation-inference:latest

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - MAX_INPUT_LENGTH=8192
  - MAX_TOTAL_TOKENS=16384
  # max_batch_prefill_tokens must be >= max_input_tokens
  - MAX_BATCH_PREFILL_TOKENS=8192
commands:
   # Activate the virtual environment at /usr/src/.venv/ 
   # as required by TGI's latest image.
   - . /usr/src/.venv/bin/activate
   - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher

port: 80
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

# Uncomment to cache downloaded models
#volumes:
#  - /data:/data

resources:
  gpu: H200:2
  disk: 500GB..

Running a configuration

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f examples/deployment/tgi/.dstack.yml

 #  BACKEND  REGION     RESOURCES                      SPOT PRICE   
 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87   
 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98 

Submit the run llama4-scout? [y/n]: y

Provisioning...
---> 100%

If no gateway is created, the model will be available via the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/.

$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "max_tokens": 128
    }'

When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/.

Source code

The source-code of this example can be found in examples/deployment/tgi :material-arrow-top-right-thin:{ .external }.

What's next?

Check services
Browse the Llama, vLLM, SgLang and NIM examples
See also AMD and TPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tgi

tgi

README.md

HuggingFace TGI

Deployment

Running a configuration

Source code

What's next?

Files

tgi

Directory actions

More options

Directory actions

More options

Latest commit

History

tgi

Folders and files

parent directory

README.md

HuggingFace TGI

Deployment

Running a configuration

Source code

What's next?