Portable LLMs with llamafile
Large language models (LLMs) have been the subject of much discussion and scrutiny recently. Of particular interest to open-source enthusiasts are the problems with running LLMs on one's own hardware — especially when doing so requires NVIDIA's proprietary CUDA toolkit, which remains unavailable in many environments. Mozilla has developed llamafile as a potential solution to these problems. Llamafile can compile LLM weights into portable, native executables for easy integration, archival, or distribution. These executables can take advantage of supported GPUs when present, but do not require them.
Portability
Llamafile is based on the MIT-licensed
llama.cpp library, a
C/C++ implementation of the code necessary to evaluate LLMs of several different
architectures. Llamafile maintains its own copy of llama.cpp, with some
additional MIT-licensed changes. The code for llamafile itself is made available
under the Apache 2.0 license. Llamafile's value comes from building
llama.cpp in a way that works seamlessly across many different environments.
On April 25, Mozilla posted
an update about llamafile's progress that called it "the easiest
and fastest way to run a wide range of open large language models
".
The lead developer of llamafile, Justine Tunney, chose to re-use her previous work on Cosmopolitan Libc. That project implements a C standard library and associated compiler infrastructure to compile programs written in C as multi-format executables that can run on Linux, macOS, Windows, several BSDs, or without an operating system.
In the blog post announcing the start of Cosmopolitan Libc, Tunney said:
I like the idea of having the freedom to write software without restrictions that transcends traditional boundaries. My goal has been helping C become a build-once run-anywhere language, suitable for greenfield development, while avoiding any assumptions that would prevent software from being shared between tech communities.
In service of that goal, Cosmopolitan Libc uses clever techniques to create executable files that can be simultaneously interpreted as several different formats. The executables it produces start with a shell script that behaves differently on different operating systems, which the program exploits to select a pre-compiled static binary suitable to the operating system and architecture at run time. These files (along with the weights of the LLM, in llamafile's case), are bundled into a zip file. Since zip files store their metadata at the end of the file, the same file can serve as shell script, executable, and zip archive.
This approach is of questionable use for most programs, but for Mozilla, it represents an important way to democratize access to LLMs. There are an increasing number of repositories, most prominently HuggingFace, that distribute raw LLM weights — but raw weights aren't actually enough to use the models. Users also rely on inference code and software that provides an API to actually access the results of the model. To make things worse, that inference code is often specific to a particular brand of GPU or machine-learning toolchain, which makes LLMs hard to run except in specific environments.
Llamafile applies Cosmopolitan Libc's philosophy of selecting an appropriate implementation at run time to machine learning. It automatically detects whether the user has AMD or NVIDIA's GPU toolchains available, and if so, uses those. If not, it uses a new open-source linear-algebra library called tinyBLAS. The library supports using the APIs made available by existing graphics drivers to take advantage of GPU acceleration without requiring an installed toolchain. This is less performant than letting NVIDIA's CUDA or AMD's ROCm compile a native program for a specific model of graphics card, but still useful for users who don't have the GPU SDKs but do have the hardware.
TinyBLAS doesn't work with all drivers, however. If no GPU acceleration is available, llamafile falls back to CPU implementations of the core linear algebra libraries — versions that are specialized for particular microarchitectures and specific hardware. On March 31, Tunney published a detailed blog post discussing how she improved CPU inference performance across a wide variety of hardware, often by hand-writing a matrix multiplication kernel tuned for that exact hardware.
There's another trick that llamafile uses to speed up matrix multiplication, though, which is much more specific to its purpose as a platform for running LLMs. Generic linear algebra libraries like BLAS need to be able to multiply arbitrary matrices with unknown dimensions, possibly transposed or weighted in some way. LLM inference, because it proceeds one token at a time, spends a lot of time doing matrix-vector multiplications that can be written in a simpler form.
Even when LLMs do generalized matrix multiplications (during initialization), the models are architected such that the matrices are usually of a known size — often a multiple of 64. This lets a hand-unrolled implementation specific to those sizes outperform a more generic algorithm. Tunney benchmarked the multiplication of a 513x512 matrix with a 512x512 one (a size llamafile uses frequently), finding that her code outperformed Intel's proprietary Math Kernel Library (MKL) — on that specific size. The MKL is still faster on other sizes. Since llamafile controls the size of the batches used during LLM initialization, however, that's still a clear performance improvement.
Using llamafile
Using an LLM packaged by llamafile is fairly straightforward. The project's README links to several examples of different sizes. Downloading a file, marking it as executable, and running it is all that should be required in the vast majority of cases. Users who have binfmt_misc registrations for WINE might need to add a more specific rule to prevent WINE from being used as the program's interpreter. Running the program with no arguments will open llama.cpp's simple chat interface:
The built-in web server also offers an OpenAI-compatible API, so tools that expect to talk to the proprietary service can be seamlessly re-directed, as can tools that use OpenAI's API design as a de-facto standard for LLM inference. Users who are more comfortable on the command line can pass parameters and instructions as arguments instead.
Parameters can also be baked into a llamafile executable. As mentioned above, the files are actually valid zip files; adding a file named .args to the executable will make it treat those arguments as additional command line parameters. The procedure for turning the llamafile binary produced by building the project into a LLM-specific llamafile for distribution is actually the same: add the weights and any required arguments to the zip file.
For performance reasons, however, it's important to add the weights without compression, and ideally aligned to a 4K boundary. This allows llamafile to map the weights directly into memory, which is substantially faster than decompressing them into non-disk-backed memory. For this purpose, the project also provides a utility called zipalign that adds files to a zip archive in the correct way.
On my laptop, which lacks any relevant GPUs but does have a spiffy 12th generation Intel i7 processor, the Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile download provided as an example evaluates provided prompts at a rate of about 16 tokens per second. The actual answer itself is evaluated at about 3.5 tokens per second. The difference is attributable to the fact that during prompt evaluation, the model can use matrix-matrix multiplications, instead of matrix-vector multiplications. But this level of performance — while perhaps too slow to process many large documents — seems entirely adequate for local use.
With LLMs becoming increasingly integrated into other software, efforts to make them easy to run on existing, consumer hardware are an important part of making sure users can benefit from the technology without sending their data to third parties. The ultimate test of whether a format is suitable for widespread use is whether it is actually adopted; with llamafile being only a few months old, it's too soon to say for sure whether the project has achieved its goals. It does, however, seem to be well on the way.
Posted May 14, 2024 17:40 UTC (Tue)
by snajpa (subscriber, #73467)
[Link] (7 responses)
Posted May 14, 2024 17:46 UTC (Tue)
by snajpa (subscriber, #73467)
[Link] (5 responses)
Posted May 15, 2024 7:07 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (4 responses)
This project, as I understand it, is basically about (a) building llama.cpp with cosmocc (for improved portability: you can literally run the same output file on linux, windows, macos); (b) speeding up linear algebra (which she is feeding upstream where relevant) with the goal of (c) bundling LLM weights and llama.cpp into single, portable, fast executables that can use GPU when possible, or run with adequate speed on CPU, without user intervention or configuration.
Posted May 15, 2024 9:55 UTC (Wed)
by snajpa (subscriber, #73467)
[Link] (3 responses)
Posted May 15, 2024 9:59 UTC (Wed)
by snajpa (subscriber, #73467)
[Link] (1 responses)
Posted May 15, 2024 14:47 UTC (Wed)
by daroc (editor, #160859)
[Link]
I apologize if I gave the impression in my article that llamafile is only a rebranded llama.cpp — llamafile has a bunch of additional code (under a different license, even) that wraps llama.cpp. See
this part of the source. The project has both a copy of llama.cpp which sends patches upstream, and support code which is used to produce the final binaries.
Posted May 15, 2024 17:03 UTC (Wed)
by niner (subscriber, #26151)
[Link]
Or earlier, GCC was forked into EGCS. Later that fork became so successful, that the remaining GCC developers decided to join the effort and EGCS was officially adopted as the new GCC.
Such things _do_ happen.
Posted May 14, 2024 18:36 UTC (Tue)
by ballombe (subscriber, #9523)
[Link]
Posted May 14, 2024 18:23 UTC (Tue)
by Heretic_Blacksheep (subscriber, #169992)
[Link]
Posted May 15, 2024 1:36 UTC (Wed)
by NightMonkey (subscriber, #23051)
[Link]
Posted May 15, 2024 7:55 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (2 responses)
Posted May 15, 2024 16:55 UTC (Wed)
by ben.alkov (guest, #171521)
[Link]
Posted May 15, 2024 18:18 UTC (Wed)
by geofft (subscriber, #59789)
[Link]
Would it be helpful if PyTorch were additionally available as a standalone application with an embedded Python interpreter (probably an embedded Jupyter or something) so it was one thing to download and install and it was independent of any other Python version/environment you might have on your aystem?
Posted May 15, 2024 8:13 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (2 responses)
llava-v1.5-7b-q4.llamafile: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
error: Uncaught SIGABRT (SI_TKILL) at 0x3e8001df83d on <hostname removed>
Posted May 15, 2024 14:26 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted May 18, 2024 9:42 UTC (Sat)
by Felix (subscriber, #36445)
[Link]
Posted May 15, 2024 19:47 UTC (Wed)
by Herve5 (subscriber, #115399)
[Link]
Posted May 16, 2024 10:00 UTC (Thu)
by yaap (subscriber, #71398)
[Link]
As I understand it, it's not the only factor. The evaluation processes the whole input buffer (so all the input tokens) in a single pass of the LLM encoder. While the generation is iterative: there will be a pass through the LLM decoder for each newly generated token.
Portable LLMs with llamafile
Portable LLMs with llamafile
She is sending patches upstream. The whole point of open source is you can fork it for your own interests, which may not match upstream. But if upstream is interested, the fork can re-merge.
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Providing users binaries can seriously distract a project from its goal, so it can be a net benefit if they are provided by someone else.
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
Portable LLMs with llamafile
./llava-v1.5-7b-q4.llamafile
File exists
Portable LLMs with llamafile
Portable LLMs with llamafile
runs everywhere, only in english?
Portable LLMs with llamafile