Tiny-LLM Inference Engine

A lightweight LLM inference engine using CUDA C++ with W8A16 quantized inference. Reduces model VRAM by 50% with INT8 weights + FP16 activations, supports KV Cache incremental decoding and multiple sampling strategies.

Features

W8A16 Quantization — INT8 weights + FP16 activations, 50% VRAM reduction, in-kernel dequantization
Efficient CUDA Kernels — Shared memory tiling, warp shuffle reduction for matmul / attention / RMSNorm
KV Cache Management — Pre-allocated GPU memory pool, incremental decoding, multi-sequence support
Multiple Sampling Strategies — Greedy, temperature, top-k, top-p (nucleus)
Modular Design — Clean separation of kernels, transformer layers, model loading, and generation
Engineering Quality — CI pipeline, clang-format, RAII memory management, Result error handling

Requirements

CUDA Toolkit 11.0+, CMake 3.18+, C++17 compiler, GPU CC 7.0+ (Volta → Hopper)

Build & Run

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Run tests
ctest --output-on-failure

Usage

#include "tiny_llm/inference_engine.h"
using namespace tiny_llm;

ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;

auto result = InferenceEngine::load("model.bin", config);
if (result.isErr()) {
    std::cerr << "Error: " << result.error() << std::endl;
    return 1;
}
auto engine = std::move(result.value());

GenerationConfig gen;
gen.max_new_tokens = 100;
gen.temperature = 0.7f;
gen.do_sample = true;

auto output = engine->generate({1, 15043, 29892}, gen);  // "Hello,"

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    InferenceEngine                           │
│  ┌───────────┐  ┌───────────────┐  ┌──────────────────────┐ │
│  │  Model    │  │  Transformer  │  │  Generation          │ │
│  │  Loader   │──▶  Layers       │──▶  (Sampling + Decode) │ │
│  └───────────┘  └───────┬───────┘  └──────────────────────┘ │
│                         │                                    │
│  ┌───────────┐  ┌───────▼───────┐  ┌──────────────────────┐ │
│  │  Stream   │  │  KV Cache     │  │  Result<T>           │ │
│  │  Pool     │  │  Manager      │  │  Error Handling      │ │
│  └───────────┘  └───────────────┘  └──────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                     CUDA Kernels                             │
│  ┌──────────────┐  ┌───────────┐  ┌────────────────────┐    │
│  │ W8A16 MatMul │  │ Attention │  │ RMSNorm            │    │
│  │ (tiling +    │  │ (KV Cache │  │ (warp shuffle      │    │
│  │  dequant)    │  │  + mask)  │  │  reduction)        │    │
│  └──────────────┘  └───────────┘  └────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

Core Components

W8A16 Matmul — INT8 weight × FP16 activation with fused in-kernel dequantization (shared memory tiling)
Attention — Prefill (multi-token, causal mask) + Decode (single-token, KV cache) modes
RMSNorm — Warp shuffle reduction, numerically stable
KV Cache — Pre-allocated GPU memory pool, multi-sequence support, stateless per-layer append + explicit advanceSeqLen
Sampling — Greedy, temperature, top-k, top-p strategies with configurable repetition penalty

Project Structure

tiny-llm/
├── include/tiny_llm/          # Public headers
│   ├── types.h                # ModelConfig, GenerationConfig, QuantizedWeight
│   ├── result.h               # Result<T> error handling (Rust-style)
│   ├── cuda_utils.h           # CUDA_CHECK, DeviceBuffer<T> RAII
│   ├── cuda_streams.h         # StreamPool
│   ├── kv_cache.h             # KVCacheManager
│   ├── model_loader.h         # Model loading
│   ├── transformer.h          # TransformerLayer
│   └── inference_engine.h     # InferenceEngine
├── kernels/                   # CUDA kernels
│   ├── w8a16_matmul.cu/.cuh   # W8A16 quantized matmul (tiling + fused dequant)
│   ├── attention.cu/.cuh      # Attention (prefill + decode, KV cache)
│   ├── rmsnorm.cu/.cuh        # RMSNorm (warp shuffle reduction)
│   ├── elementwise.cu/.cuh    # Elementwise ops (SiLU, residual add)
│   └── warp_utils.cuh         # Warp-level primitives
├── src/                       # Host source files
│   ├── inference_engine.cpp   # Engine main logic
│   ├── transformer.cpp        # Transformer forward pass
│   ├── kv_cache.cpp           # KV cache alloc / append / reclaim
│   ├── model_loader.cpp       # Model file loading
│   └── main.cpp               # Demo entry point
├── tests/                     # Google Test
└── CMakeLists.txt             # CMake build (v2.0.0, FetchContent GTest)

GPU Support

Architecture	Compute Capability	Example GPUs
Volta	SM 7.0	V100
Turing	SM 7.5	RTX 2080, T4
Ampere	SM 8.0 / 8.6	A100, RTX 3090
Ada Lovelace	SM 8.9	RTX 4090, L40
Hopper	SM 9.0	H100

Testing

./tiny_llm_tests --gtest_filter="W8A16*"       # Quantized matmul
./tiny_llm_tests --gtest_filter="Attention*"    # Attention mechanism
./tiny_llm_tests --gtest_filter="KVCache*"      # Cache management
./tiny_llm_tests --gtest_filter="Integration*"  # End-to-end

Test Suite	Coverage
W8A16 MatMul	Quantization accuracy, tiling correctness, boundary sizes
Attention	Masked self-attention, KV cache append, prefill/decode
RMSNorm	Normalization invariants, numerical stability
KV Cache	Allocation, append, multi-sequence, advanceSeqLen
Transformer	Layer forward pass, weight loading
Integration	End-to-end prompt → generation

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
.kiro/specs/tiny-llm-inference-engine		.kiro/specs/tiny-llm-inference-engine
changelog		changelog
docs		docs
include/tiny_llm		include/tiny_llm
kernels		kernels
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny-LLM Inference Engine

Features

Requirements

Build & Run

Usage

Architecture

Core Components

Project Structure

GPU Support

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny-LLM Inference Engine

Features

Requirements

Build & Run

Usage

Architecture

Core Components

Project Structure

GPU Support

Testing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages