How to Containerize Your Local LLM
Learn how to create a container to expose your local large language model (LLM) for consumption by an API.
Presented by Ezequiel Lanza — Open Source AI Evangelist (Intel)
Let’s say you’re building a chatbot for your company and you need to find the best way to deploy it. You might start by building your logic in a Jupyter Notebook, but this won’t work when you want to deploy it in a live environment where users can interact with it. You might then start thinking about a suitable strategy for making your application scalable, portable, and efficient. Cloud native development emerges as the best alternative, allowing you to focus on topics like scaling and dimensioning while treating each piece as an isolated module.
So, what happens next? You need to create the core components of your application, starting with the large language model (LLM). You might want an LLM container that receives a string input and returns the model’s response (Image 1). That sounds easy, right?
Unfortunately, this can be challenging for any application composed of more than just a language model. For instance, you may have a React front end that doesn’t understand the intricacies of LLMs, such as LLM pipelines, AI frameworks, or optimizations needed to make your model more efficient. To address this, you need to abstract the LLM logic and provide an API for the application to interact with by containerizing your model. This approach offers several benefits characteristic of microservices architecture, such as scalability, isolation, portability, efficiency, and continuous deployment.
In this post, we’ll walk you through the step-by-step process for configuring and creating an LLM container for further use by an application. This post builds off the project we shared in our article Easily Deploy Multiple LLMs in a Cloud Native Environment. You may want to read that one first before proceeding.
You’ll find the code you need for this project here: https://github.com/intel/Multi-llms-Chatbot-CloudNative-LangChain/tree/main
Why a Container?
When containerizing LLM logic, you can use either a local or external model. It’s worth mentioning that if you choose to use a local model, you don’t have to include it on the container image. This avoids having to package large model files (~26GB for a 7B model) with the container, which can significantly increase the container size and slow down deployment times. Additionally, it eliminates the need to rebuild the container image whenever the model is updated or retrained. Instead, you can store the model on a local server or file system that the containerized application can access when the container launches, allowing for more efficient management and deployment.
Since this example is based on a project, you’ll first need to clone the repo with the code.
git clone https://github.com/intel/Multi-llms-Chatbot-CloudNative-LangChain/tree/main
As mentioned in the previous post , the repo explains how to create all the containers you need to build your own multi-chatbot, such as the front end, the proxy, or the LLM model. In this case, we’ll focus on the container for the LLaMa2 non-optimized model.
What Will Be Inside The Container?
This example assumes you want to use a local non-optimized model, which means you didn’t optimize your model using a tool like Intel® Extension for Transformers (ITREX) to make it use fewer resources. In short, you want to use a vanilla version of LLaMa2, which you can download HERE. Save the model in an externally accessible place (ideally a file server) to be used by the container when it’s launched.
You might be wondering, why aren’t we storing the model on the container image? Let’s think about it. Models are heavy. Consider, for example, the LLaMA2 model family. The LLaMA2 13B model has 13 billion parameters and requires around 40 gigabytes of storage space. Loading such a large model directly on the container image would significantly increase the container’s size, making it more cumbersome to deploy and manage. By keeping the model on an external file server, you can keep the container lightweight and more agile, facilitating easy updates and scalability. This approach also allows for better resource management, as the model can be shared across multiple containers or instances, reducing redundancy and optimizing storage usage.
The repo is organized by folders and each folder corresponds to a different image in the chatbot. Go to the folder where the local models scripts are, in this case : “LLAMA-non
cd 3__Local_Models/LLAMA-non
Within that folder, you’ll find all the files needed to build the LLM container. The following is a tree that Docker will use to create the container.
LLAMA-non
├── Dockerfile
├── app
│ ├── __init__.py
│ ├── llama2.py
│ └── server.py
└── requirements.txt
Let’s explore each part before building the container.
Application Folder (/app)
Any application that will live in a container has to be declared. Within the application folder (/app), we can typically find the scripts that will make the application run. In our case, we’ll have two scripts (llama.py and server.py), one to declarethe LLM pipeline object and the other to expose the API. These scripts will be run when the container is launched. Let’s explore each of them.
Llama2.py (/app/llama.py)
Here is where the pipeline will live. A pipeline abstracts the complexities of the underlying models and provides an easy-to-use interface for common tasks such as text classification, question answering, or text generation.
The code below sets up a Hugging Face text generation pipeline using a locally stored LLaMa2 model and its tokenizer (downloaded offline HERE). Having models locally hosted allows further customization; when configuring the pipeline, you could set specific parameters to control the text generation behavior, like limiting the max number of tokens generated by max_new_tokens, penalizing the repetition of words in the generation with repetition_penalty or defining how “creative” the model will be with temperature.
Since we’ll be using LangChain further to integrate the pipeline, we should put the pipeline in the provided LangChain HuggingFacePipeline object, making it ready to be served via LangChain’s API.
from transformers import pipeline,LlamaForCausalLM, LlamaTokenizer
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
# Local or File server address where the local model is stored
model_path="/fs_mounted/Models/llama-2-7b-chat-hf"
local_model = LlamaForCausalLM.from_pretrained(model_path, return_dict=True, trust_remote_code=True)
local_tokenizer = LlamaTokenizer.from_pretrained(model_path)# Hugging face pipeline
pipe= pipeline(task="text-generation", model=local_model, tokenizer=local_tokenizer,
trust_remote_code=True, max_new_tokens=100,
repetition_penalty=1.1, model_kwargs={"max_length": 1200, "temperature": 0.01}) #Pipeline to be consumed by Langserve API
llm_pipeline = HuggingFacePipeline(pipeline=pipe)
server.py (/app/server.py)
As we mentioned, to interact with the model in our cloud native environment, we expose the LLM via an API call. External applications or front ends can simply send an API POST message and receive a response, without needing to understand the pipeline or LLM parameters.
This is where our FastAPI application comes in. By creating an instance of FastAPI, we set up a web service that can handle these API requests. We configure cross-origin resource sharing (CORS) middleware to allow requests from any origin, which is useful during the development stage. This ensures that our front end, regardless of its origin, can communicate with the backend. Note that you should adapt this configuration when it’s used in your production scenario.
But this is an LLM server, so we define a prompt template to guide the LLM in providing clear, concise, and accurate answers. The template instructs the model to respond within a specified character limit and to admit when it doesn’t know the answer, ensuring reliability and consistency in responses.
We’ll also need to set up a root endpoint to redirect users to the API documentation page, making it easy for developers to understand and use the API. We then add specific routes to handle requests involving the LLM pipeline. These routes use the predefined prompt template and the LLM pipeline to process inputs and generate responses.
When the script is run directly (when the container launches), the application starts using Uvicorn, a fast ASGI server, and listens on localhost at port 5005. This setup allows us to expose the LLM through a simple API call, making it accessible to external applications without requiring them to handle the complexities of the LLM’s internal workings.
from fastapi import FastAPI
from langchain.prompts import PromptTemplate
from fastapi.responses import RedirectResponse
from fastapi.middleware.cors import CORSMiddleware
from langserve import add_routes
from app.llama2 import llm_pipeline
# Initializes a FastAPI app instance
app = FastAPI()# Set up CORS middleware to allow requests from any origin
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Set this to the specific origin of your frontend in production
allow_credentials=False,
allow_methods=["*"],
allow_headers=["*"],
)#Defines a template for prompts that will be sent to the LLM. The template ensures that the assistant gives clear, concise, and accurate answers.template = """You are a very smart and educated assistant to guide the user to understand the concepts. Please Explain the answer
If you don't know the answer, just say that you don't know, don't try to make up an answer.Question: {question}Only return the helpful answer below and nothing else. Give an answer in 1000 characteres at maximum please
Helpful answer:
"""prompt = PromptTemplate.from_template(template)
@app.get("/")
async def redirect_root_to_docs():
return RedirectResponse("/docs")add_routes(app,
prompt|llm_pipeline,
path='/chain_llama_non')if __name__ == "__main__":
import uvicorn uvicorn.run(app, host="localhost", port=5005)
Create and Upload Your Container to a Registry
Now that we’ve defined the files we need, the next step is to create the container image with those files including all the dependencies to allow easy replication. Finally, we’ll upload the image to our preferred registry so any environment (Kubernetes cluster) can have access to it.
Create the Container image
The following is the file the Docker engine needs to create the container; we can think of a Dockerfile in two parts: image creation and execution.
For image creation, we need to define the steps the engine has to follow to create the image. In this stage, we instruct the engine to use a determined image, which in our case will be Python. We will use the Python 3.11-slim version because, as mentioned, the idea of the container is to use minimal software. For example, we don’t need to install, say, the entire Ubuntu image (which comes with several software components that won’t be used and will make the container heavier for no reason). In addition, all the required files that will be used for the application need to be declared. The COPY
command transfers the requirements.txt
file and the application directory (./app
) into the Docker container. requirements.txt
lists Python dependencies, while the application directory contains the source code. This ensures the application is packaged with its dependencies and ready for execution inside the container.
FROM python:3.11-slim# Set the working directory in the container
WORKDIR /usr/#Copy requirements file to install python dependencies
COPY requirements.txt ./# Upgrade pip and install dependencies
RUN pip install --upgrade pip && \
pip install -r requirements.txt &&\
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu# Copy app folder where the application is inside the container
COPY ./app ./app#Expose port 5005, the application will expose the API on that port
EXPOSE 5005
After the image is created, we need to declare what the container will do when launched. This is declared using CMD, which stands for commands that will be executed in the command line. In our example, we will be launching the Uvicorn server in port 5005.
#Command the container image will run when is lauched.
CMD exec uvicorn app.server:app --host 0.0.0.0 --port 5005
Let’s build our container! In this case, we’ll build a container to run on Intel X86 architecture to take advantage of further optimizations like Intel® Extension for PyTorch. Those optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs.
NOTE: BE SURE TO HAVE YOUR DOCKER ENGINE INSTALLED
Refer to https://www.docker.com
docker build --platform linux/amd64 -t llama7b-non-optimized:latest .
Upload the Container
We should now see our container. Next, we’ll upload it to a registry. This will be relevant when building your Kubernetes environment.
docker login
docker tag llama7b-non-optimized:latest <username>/llama7b-non-optimized:latest
docker push <username>/llama7b-non-optimized:latest
Your container is now in your Docker Hub and ready for use when you deploy your Kubernetes cluster.
Call to Action
Containerizing your model can transform how you deploy and scale your applications. Clone the repository and follow the guide to build your own LLaMa container and make it accessible via an API: Multi-llms-Chatbot-CloudNative-LangChain GitHub repo
This step-by-step process will help make your model scalable, portable, and ready for production.
For more articles on LLM topics, be sure to also checK out:
· Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting
· Easily Deploy Multiple LLMs in a Cloud Native Environment
· Four Data Cleaning Techniques to Improve Large Language Model (LLM) Performance
· Optimize Vector Databases, Enhance RAG-Driven Generative AI
About the Author
Ezequiel Lanza, Open Source AI Evangelist, Intel
Ezequiel Lanza is an open source AI evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X at @eze_lanza and LinkedIn at /eze_lanza
Follow us!
Medium, Podcast, Open.intel , X , Linkedin