Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Latest commit

 

History

History

tgi

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
title description
HuggingFace TGI
This example shows how to deploy Llama 4 Scout to any cloud or on-premises environment using HuggingFace TGI and dstack.

HuggingFace TGI

This example shows how to deploy Llama 4 Scout with dstack using HuggingFace TGI :material-arrow-top-right-thin:{ .external }{:target="_blank"}.

??? info "Prerequisites" Once dstack is installed, go ahead clone the repo, and run dstack init.

<div class="termy">

```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
```

</div>

Deployment

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct :material-arrow-top-right-thin:{ .external }{:target="_blank"} using TGI.

type: service
name: llama4-scout

image: ghcr.io/huggingface/text-generation-inference:latest

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - MAX_INPUT_LENGTH=8192
  - MAX_TOTAL_TOKENS=16384
  # max_batch_prefill_tokens must be >= max_input_tokens
  - MAX_BATCH_PREFILL_TOKENS=8192
commands:
   # Activate the virtual environment at /usr/src/.venv/ 
   # as required by TGI's latest image.
   - . /usr/src/.venv/bin/activate
   - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher

port: 80
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

# Uncomment to cache downloaded models
#volumes:
#  - /data:/data

resources:
  gpu: H200:2
  disk: 500GB..

Running a configuration

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f examples/deployment/tgi/.dstack.yml

 #  BACKEND  REGION     RESOURCES                      SPOT PRICE   
 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87   
 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98 

Submit the run llama4-scout? [y/n]: y

Provisioning...
---> 100%

If no gateway is created, the model will be available via the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/.

$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "max_tokens": 128
    }'

When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/.

Source code

The source-code of this example can be found in examples/deployment/tgi :material-arrow-top-right-thin:{ .external }.

What's next?

  1. Check services
  2. Browse the Llama, vLLM, SgLang and NIM examples
  3. See also AMD and TPU