| |

Subscribe / Log in / New account

Portable LLMs with llamafile

By Daroc Alden
May 14, 2024

Large language models (LLMs) have been the subject of much discussion and scrutiny recently. Of particular interest to open-source enthusiasts are the problems with running LLMs on one's own hardware — especially when doing so requires NVIDIA's proprietary CUDA toolkit, which remains unavailable in many environments. Mozilla has developed llamafile as a potential solution to these problems. Llamafile can compile LLM weights into portable, native executables for easy integration, archival, or distribution. These executables can take advantage of supported GPUs when present, but do not require them.

Portability

Llamafile is based on the MIT-licensed llama.cpp library, a C/C++ implementation of the code necessary to evaluate LLMs of several different architectures. Llamafile maintains its own copy of llama.cpp, with some additional MIT-licensed changes. The code for llamafile itself is made available under the Apache 2.0 license. Llamafile's value comes from building llama.cpp in a way that works seamlessly across many different environments. On April 25, Mozilla posted an update about llamafile's progress that called it "the easiest and fastest way to run a wide range of open large language models".

The lead developer of llamafile, Justine Tunney, chose to re-use her previous work on Cosmopolitan Libc. That project implements a C standard library and associated compiler infrastructure to compile programs written in C as multi-format executables that can run on Linux, macOS, Windows, several BSDs, or without an operating system.

In the blog post announcing the start of Cosmopolitan Libc, Tunney said:

I like the idea of having the freedom to write software without restrictions that transcends traditional boundaries. My goal has been helping C become a build-once run-anywhere language, suitable for greenfield development, while avoiding any assumptions that would prevent software from being shared between tech communities.

In service of that goal, Cosmopolitan Libc uses clever techniques to create executable files that can be simultaneously interpreted as several different formats. The executables it produces start with a shell script that behaves differently on different operating systems, which the program exploits to select a pre-compiled static binary suitable to the operating system and architecture at run time. These files (along with the weights of the LLM, in llamafile's case), are bundled into a zip file. Since zip files store their metadata at the end of the file, the same file can serve as shell script, executable, and zip archive.

This approach is of questionable use for most programs, but for Mozilla, it represents an important way to democratize access to LLMs. There are an increasing number of repositories, most prominently HuggingFace, that distribute raw LLM weights — but raw weights aren't actually enough to use the models. Users also rely on inference code and software that provides an API to actually access the results of the model. To make things worse, that inference code is often specific to a particular brand of GPU or machine-learning toolchain, which makes LLMs hard to run except in specific environments.

Llamafile applies Cosmopolitan Libc's philosophy of selecting an appropriate implementation at run time to machine learning. It automatically detects whether the user has AMD or NVIDIA's GPU toolchains available, and if so, uses those. If not, it uses a new open-source linear-algebra library called tinyBLAS. The library supports using the APIs made available by existing graphics drivers to take advantage of GPU acceleration without requiring an installed toolchain. This is less performant than letting NVIDIA's CUDA or AMD's ROCm compile a native program for a specific model of graphics card, but still useful for users who don't have the GPU SDKs but do have the hardware.

TinyBLAS doesn't work with all drivers, however. If no GPU acceleration is available, llamafile falls back to CPU implementations of the core linear algebra libraries — versions that are specialized for particular microarchitectures and specific hardware. On March 31, Tunney published a detailed blog post discussing how she improved CPU inference performance across a wide variety of hardware, often by hand-writing a matrix multiplication kernel tuned for that exact hardware.

There's another trick that llamafile uses to speed up matrix multiplication, though, which is much more specific to its purpose as a platform for running LLMs. Generic linear algebra libraries like BLAS need to be able to multiply arbitrary matrices with unknown dimensions, possibly transposed or weighted in some way. LLM inference, because it proceeds one token at a time, spends a lot of time doing matrix-vector multiplications that can be written in a simpler form.

Even when LLMs do generalized matrix multiplications (during initialization), the models are architected such that the matrices are usually of a known size — often a multiple of 64. This lets a hand-unrolled implementation specific to those sizes outperform a more generic algorithm. Tunney benchmarked the multiplication of a 513x512 matrix with a 512x512 one (a size llamafile uses frequently), finding that her code outperformed Intel's proprietary Math Kernel Library (MKL) — on that specific size. The MKL is still faster on other sizes. Since llamafile controls the size of the batches used during LLM initialization, however, that's still a clear performance improvement.

Using llamafile

Using an LLM packaged by llamafile is fairly straightforward. The project's README links to several examples of different sizes. Downloading a file, marking it as executable, and running it is all that should be required in the vast majority of cases. Users who have binfmt_misc registrations for WINE might need to add a more specific rule to prevent WINE from being used as the program's interpreter. Running the program with no arguments will open llama.cpp's simple chat interface:

The built-in web server also offers an OpenAI-compatible API, so tools that expect to talk to the proprietary service can be seamlessly re-directed, as can tools that use OpenAI's API design as a de-facto standard for LLM inference. Users who are more comfortable on the command line can pass parameters and instructions as arguments instead.

Parameters can also be baked into a llamafile executable. As mentioned above, the files are actually valid zip files; adding a file named .args to the executable will make it treat those arguments as additional command line parameters. The procedure for turning the llamafile binary produced by building the project into a LLM-specific llamafile for distribution is actually the same: add the weights and any required arguments to the zip file.

For performance reasons, however, it's important to add the weights without compression, and ideally aligned to a 4K boundary. This allows llamafile to map the weights directly into memory, which is substantially faster than decompressing them into non-disk-backed memory. For this purpose, the project also provides a utility called zipalign that adds files to a zip archive in the correct way.

On my laptop, which lacks any relevant GPUs but does have a spiffy 12th generation Intel i7 processor, the Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile download provided as an example evaluates provided prompts at a rate of about 16 tokens per second. The actual answer itself is evaluated at about 3.5 tokens per second. The difference is attributable to the fact that during prompt evaluation, the model can use matrix-matrix multiplications, instead of matrix-vector multiplications. But this level of performance — while perhaps too slow to process many large documents — seems entirely adequate for local use.

With LLMs becoming increasingly integrated into other software, efforts to make them easy to run on existing, consumer hardware are an important part of making sure users can benefit from the technology without sending their data to third parties. The ultimate test of whether a format is suitable for widespread use is whether it is actually adopted; with llamafile being only a few months old, it's too soon to say for sure whether the project has achieved its goals. It does, however, seem to be well on the way.

Portable LLMs with llamafile

Posted May 14, 2024 17:40 UTC (Tue) by snajpa (subscriber, #73467) [Link] (7 responses)

I don't get this need to fork llama.cpp just so that people can claim they have something of their own. This should have been upstreamed, just as pretty much every other llama.cpp fork. But most never try and fork right away. I fail to see how is that approach useful to anyone.

Portable LLMs with llamafile

Posted May 14, 2024 17:46 UTC (Tue) by snajpa (subscriber, #73467) [Link] (5 responses)

I mean, llama.cpp is still a pretty young project where the codebase changes rapidly and these kinds of changes are what defines how the codebase is going to evolve - trying to spill that evolution out of the central pot just makes way more mess than necessary. If one takes a look at how the forks are doing - they all eventually fall way behind. It's a shame.

Portable LLMs with llamafile

Posted May 15, 2024 7:07 UTC (Wed) by rsidd (subscriber, #2582) [Link] (4 responses)

She is sending patches upstream. The whole point of open source is you can fork it for your own interests, which may not match upstream. But if upstream is interested, the fork can re-merge.

This project, as I understand it, is basically about (a) building llama.cpp with cosmocc (for improved portability: you can literally run the same output file on linux, windows, macos); (b) speeding up linear algebra (which she is feeding upstream where relevant) with the goal of (c) bundling LLM weights and llama.cpp into single, portable, fast executables that can use GPU when possible, or run with adequate speed on CPU, without user intervention or configuration.

Portable LLMs with llamafile

Posted May 15, 2024 9:55 UTC (Wed) by snajpa (subscriber, #73467) [Link] (3 responses)

I don't find it a normal practice at all. I have never seen people fork a project, spend an effort to come up with a unique name, make the effort to claim it's their project, only to be able to contribute to the original project. People just open a PR for that. But no, llama.cpp world seems special, everyone needs to claim _they_ invented a piece of this AI thing, like it's the last thing anyone will ever invent. IMHO an explanation like yours doesn't work (here).

Portable LLMs with llamafile

Posted May 15, 2024 9:59 UTC (Wed) by snajpa (subscriber, #73467) [Link] (1 responses)

Btw @ single exec file, this has been a done thing now for a while also. There's ollama, which also packs it all (and if it doesn't support all three platforms it's certainly their goal), but unlike _this_ llama.cpp fork, that project has a real added value. It really makes running LLMs easy for people, it abstracts the llama.cpp's rough edges to present a smooth workflows for people who never touched any of this stuff. It's also original code, which *includes* llama.cpp, rather than just "rebrands" it.

Portable LLMs with llamafile

Posted May 15, 2024 14:47 UTC (Wed) by daroc (editor, #160859) [Link]

I apologize if I gave the impression in my article that llamafile is only a rebranded llama.cpp — llamafile has a bunch of additional code (under a different license, even) that wraps llama.cpp. See this part of the source. The project has both a copy of llama.cpp which sends patches upstream, and support code which is used to produce the final binaries.

Portable LLMs with llamafile

Posted May 15, 2024 17:03 UTC (Wed) by niner (subscriber, #26151) [Link]

What about OpenWRT? LEDE was forked off OpenWRT in 2016 and in 2018 the two projects merged again.

Or earlier, GCC was forked into EGCS. Later that fork became so successful, that the remaining GCC developers decided to join the effort and EGCS was officially adopted as the new GCC.

Such things _do_ happen.

Portable LLMs with llamafile

Posted May 14, 2024 18:36 UTC (Tue) by ballombe (subscriber, #9523) [Link]

It seems to me this project is mostly about providing users binaries for llama.cpp.
Providing users binaries can seriously distract a project from its goal, so it can be a net benefit if they are provided by someone else.

Portable LLMs with llamafile

Posted May 14, 2024 18:23 UTC (Tue) by Heretic_Blacksheep (subscriber, #169992) [Link]

"while avoiding any assumptions" about sharing code... Aspirational, but not realistic. Ultimately, the assumptions have to be based on reality. In reality both copyright and regulatory rules matter. This is why the scope of the information the statistical model mirrors must be finite and therefore auditable. "Garbage in; garbage out" is not automagically optimized away by AI. Quite the opposite. I also don't buy that I should trust C (or C++) with a modeling generator meant to operate on infinite input. PLT (and history) shows this is a recipe for disastrous machine states.

Portable LLMs with llamafile

Posted May 15, 2024 1:36 UTC (Wed) by NightMonkey (subscriber, #23051) [Link]

Just want to mention this tool that I use in my local experiments with LLMs: https://github.com/simonw/llm, enhanced with this plugin: https://github.com/simonw/llm-gpt4all to be able to easily install and run open models locally.

Portable LLMs with llamafile

Posted May 15, 2024 7:55 UTC (Wed) by taladar (subscriber, #68407) [Link] (2 responses)

My impression so far is that LLMs and other generative AI are mainly made complicated by the fact that it all involves Python in some way and the AI ecosystem seems to have copied the messiness of Python distribution while also lagging a few versions behind. If tooling could be improved in that respect, ideally by completely getting rid of any Python requirements and any GPU specific libraries (as in "this is the pytorch for ROCm, this is the one for CUDA, this is the one for CPU") it would make it a lot more accessible.

Portable LLMs with llamafile

Posted May 15, 2024 16:55 UTC (Wed) by ben.alkov (guest, #171521) [Link]

Specifically addressing Torch - their recently-available binary distro downloadable from PyPi *does* bundle everything, so that there's no need to try to figure out which one you need. Of course, the download is significantly larger than the individual platform packages.

Portable LLMs with llamafile

Posted May 15, 2024 18:18 UTC (Wed) by geofft (subscriber, #59789) [Link]

I mean, there was/is a thing called just Torch, but PyTorch, a set of Python bindings to it, got popular because you do want some sort of REPL / interpreted language to work with it and Python is a pretty good one, especially for the problem domain of scientific computing + interfacing with data sets. (I think OG Torch came with Lua, which is a little easier to embed into a C codebase but much less popular for scientific computing.)

Would it be helpful if PyTorch were additionally available as a standalone application with an embedded Python interpreter (probably an embedded Jupyter or something) so it was one thing to download and install and it was independent of any other Python version/environment you might have on your aystem?

Portable LLMs with llamafile

Posted May 15, 2024 8:13 UTC (Wed) by taladar (subscriber, #68407) [Link] (2 responses)

GPU support does not seem to work for me on AMD. It seems to produce an assertion failure in hip on Gentoo

llava-v1.5-7b-q4.llamafile: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

error: Uncaught SIGABRT (SI_TKILL) at 0x3e8001df83d on <hostname removed>
./llava-v1.5-7b-q4.llamafile
File exists

Portable LLMs with llamafile

Posted May 15, 2024 14:26 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

Does *anything* work in ROCm? My impression of it over the past 5-10 years, from the volume of complaints on the internet about it, has been that it's the best advertising campaign Nvidia could've wished for.

Portable LLMs with llamafile

Posted May 18, 2024 9:42 UTC (Sat) by Felix (subscriber, #36445) [Link]

At least on my system (Fedora 40), I can run simple models like the llama3-8b using llamafile+rocm and I see a pretty decent speedup when using the GPU. I'm using the rocm packages provided by Fedora so I think the situation is not that bad even though there is a lot of things which are not great (e.g. support for more GPUs, more AMD work regarding distro integration, ...).

runs everywhere, only in english?

Posted May 15, 2024 19:47 UTC (Wed) by Herve5 (subscriber, #115399) [Link]

Do I understand correctly that all the proposed packages are english-based as concerns de language?

Portable LLMs with llamafile

Posted May 16, 2024 10:00 UTC (Thu) by yaap (subscriber, #71398) [Link]

> The difference is attributable to the fact that during prompt evaluation, the model can use matrix-matrix multiplications, instead of matrix-vector multiplications.

As I understand it, it's not the only factor. The evaluation processes the whole input buffer (so all the input tokens) in a single pass of the LLM encoder. While the generation is iterative: there will be a pass through the LLM decoder for each newly generated token.

Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds