SGEMM Optimization: From Naive to Tensor Core

Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to Tensor Core WMMA reaching 40% of cuBLAS throughput.

Performance (RTX 3060 Laptop, 1024×1024×1024)

Kernel	GFLOPS	vs cuBLAS	Time	Key Technique
cuBLAS (ref)	5727	100%	0.375 ms	NVIDIA optimized library
Tensor Core (WMMA)	2300	40.2%	0.934 ms	FP16→FP32 mixed precision
Tiled (32×32)	753	13.1%	2.853 ms	Shared memory blocking
Double Buffer	701	12.2%	3.064 ms	Compute-memory overlap
Bank Conflict Free	673	11.8%	3.190 ms	Shared memory padding (+1)
Naive	604	10.6%	3.553 ms	One thread per output element

All kernels verified against cuBLAS (allclose: rtol=1e-3, atol=1e-4; Tensor Core: rtol=5e-2)

Optimization Roadmap

  ┌─────────┐     ┌──────────┐     ┌──────────────┐     ┌───────────────┐
  │  Naive  │────▶│  Tiled   │────▶│  Bank-Free   │────▶│ Double Buffer │
  │ 604 GF  │     │ 753 GF   │     │   673 GF     │     │   701 GF      │
  └─────────┘     └──────────┘     └──────────────┘     └───────┬───────┘
                                                                │
                                                                ▼
                                                    ┌───────────────────┐
                                                    │   Tensor Core     │
                                                    │   2300 GF (WMMA)  │
                                                    └───────────────────┘

Stage	What Changes	Why It Helps
Naive → Tiled	Load tiles into shared memory	Data reuse reduces global memory traffic by TILE_SIZE×
Tiled → Bank-Free	Pad shared memory `[32][33]`	Eliminates 32-way bank conflicts on column access
Bank-Free → Double Buffer	Two shared-memory buffers	Overlaps next-tile load with current-tile compute
→ Tensor Core	WMMA API `mma_sync`	Dedicated matrix units, ~8× peak over CUDA cores

Build & Run

# Makefile (adjust GPU arch for your hardware)
make GPU_ARCH=sm_86
make benchmark

# Or CMake
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark

Project Structure

sgemm-optimization/
├── src/
│   ├── kernels/
│   │   ├── naive_sgemm.cuh              # Naive: basic triple loop
│   │   ├── tiled_sgemm.cuh              # Tiled: shared memory blocking
│   │   ├── bank_conflict_free_sgemm.cuh # Bank conflict elimination
│   │   ├── double_buffer_sgemm.cuh      # Double buffer pipeline
│   │   └── tensor_core_sgemm.cuh        # Tensor Core (WMMA API)
│   ├── utils/
│   │   ├── cuda_utils.cuh               # CUDA error checking & utilities
│   │   ├── benchmark.cuh                # Benchmark framework (CUDA Events)
│   │   └── verify.cuh                   # Correctness verification (vs cuBLAS)
│   └── main.cu                          # Entry point
├── tests/
│   └── test_sgemm.cu                    # Google Test property tests
├── roofline_data_*.csv                  # Roofline analysis data
├── CMakeLists.txt                       # CMake build (recommended)
└── Makefile                             # Make build (quick start)

Testing

Property-based tests with Google Test:

Property	What It Verifies
Numerical correctness	All kernels match cuBLAS output (allclose)
Tensor Core tolerance	Correct under relaxed FP16 tolerance
Error detection	Verification system catches injected errors
Dimension invariance	All kernels handle arbitrary aligned sizes

make test
# Or: cmake --build build --target test_sgemm && ctest --test-dir build

GPU Architecture Reference

GPU Family	Architecture	Compute Capability	Build Flag
Tesla V100	Volta	sm_70	`GPU_ARCH=sm_70`
RTX 2080	Turing	sm_75	`GPU_ARCH=sm_75`
RTX 3090 / A100	Ampere	sm_80 / sm_86	`GPU_ARCH=sm_86`
RTX 4090 / L40	Ada Lovelace	sm_89	`GPU_ARCH=sm_89`
H100	Hopper	sm_90	`GPU_ARCH=sm_90`

Engineering Quality

Build: CMake 3.18+ with target_include_directories, target_compile_options (generator expressions), FetchContent for GTest v1.14.0
Code style: clang-format enforced via CI
CI: GitHub Actions — CUDA container build + format check
Testing: Google Test property-based verification against cuBLAS

References

CUDA C++ Programming Guide
How to Optimize a CUDA Matmul Kernel — Simon Boehm
CUTLASS — NVIDIA's high-performance GEMM library
cuBLAS Documentation
Roofline Model

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
.kiro/specs/sgemm-optimization		.kiro/specs/sgemm-optimization
.vscode		.vscode
build		build
changelog		changelog
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md
roofline_data_1024.csv		roofline_data_1024.csv
roofline_data_2048.csv		roofline_data_2048.csv
roofline_data_4096.csv		roofline_data_4096.csv
roofline_data_512.csv		roofline_data_512.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGEMM Optimization: From Naive to Tensor Core

Performance (RTX 3060 Laptop, 1024×1024×1024)

Optimization Roadmap

Build & Run

Project Structure

Testing

GPU Architecture Reference

Engineering Quality

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SGEMM Optimization: From Naive to Tensor Core

Performance (RTX 3060 Laptop, 1024×1024×1024)

Optimization Roadmap

Build & Run

Project Structure

Testing

GPU Architecture Reference

Engineering Quality

References

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages