Ars Technica
Things move at lightning speed in AI Land. On Friday, a software developer named Georgi Gerganov created a tool called “llama.cpp” that can run Meta’s new GPT-3 class large artificial intelligence language model, LLaMA, locally on a Mac laptop. A little later, people worked out how to run LLaMA on Windows as well. Then someone it was seen on the run on a Pixel 6 phone. The next one has come a Raspberry Pi (albeit very late).
If this continues, we could be looking at a pocket-sized ChatGPT competitor before we know it.
But let’s back up a minute, because we’re not there yet. (At least not today — like literally today, March 13, 2023.) But what next week will bring, no one knows.
Since ChatGPT was launched, some people have been frustrated by the AI model’s built-in limits that prevent it from discussing topics that OpenAI considers sensitive. Thus began the dream—in some quarters—of an open source large language model (LLM) that anyone could run locally without censorship and without paying API fees to OpenAI.
There are open source solutions (like GPT-J), but they require a lot of GPU RAM and storage. Other open source alternatives could not boast GPT-3 level performance on readily available consumer grade hardware.
Enter LLaMA, an LLM available in parameter sizes ranging from 7B to 65B (that’s “B” as in “billions of parameters”, which are floating-point numbers stored in arrays that represent what the model “knows”). LLaMA made a heady claim: that its smaller models could match OpenAI’s GPT-3, the foundational model that powers ChatGPT, in terms of quality and speed of output. There was just one problem—Meta released the LLaMA code open-source, but kept the “weights” (the trained “knowledge” stored in a neural network) only for specialized researchers.
Flying at the speed of LLaMA
Meta’s restrictions on LLaMA didn’t last long, because on March 2nd, someone leaked the LLaMA weights on BitTorrent. Since then, there has been an explosion of growth around LLaMA. Independent AI researcher Simon Willison compared this situation to the release of Stable Diffusion, an open source image compositing model released last August. See what he wrote in his blog post:
It seems to me that that Stable Diffusion moment in August started the whole new wave of interest in genetic AI – which was then kicked into overdrive with the launch of ChatGPT in late November.
This Stable Diffusion moment is happening again right now, for large language models—the technology behind ChatGPT itself. This morning I ran a GPT-3 class language model on my own personal laptop for the first time!
The AI stuff was already weird. It’s about to get a lot weirder.
Typically, running GPT-3 requires several data center-class A100 GPUs (also, the weights for GPT-3 are not public), but LLaMA made waves because it could run on a single consumer GPU. And now, with optimizations that reduce the model size using a technique called quantization, LLaMA can run on an M1 Mac or a smaller Nvidia consumer GPU.
Things are moving so fast that sometimes it’s hard to keep up with the latest developments. (Regarding the pace of AI progress, a fellow AI reporter told Ars, “It’s like those dog videos where you lift a cage of tennis balls at them. [They] I don’t know where to hunt first and get lost in the confusion.”)
For example, here’s a list of notable events related to LLaMA, based on a timeline Willison presented in a Hacker News comment:
- February 24, 2023: Meta AI announces LLaMA.
- March 2, 2023: Someone leaks the LLaMA models via BitTorrent.
- March 10, 2023: Georgi Gerganov creates llama.cpp, which can run on M1 Macs.
- March 11, 2023: Artem Andreenko runs LLaMA 7B (late) on a Raspberry Pi 44 GB RAM, 10 seconds/token.
- March 12, 2023: LLaMA 7B runs on NPX, a node.js runtime.
- March 13, 2023: Someone starts running llama.cpp on a Pixel 6 phonetoo late too.
- March 13, 2023, 2023: Standord releases Alpaca 7B, an instruction-tuned version of LLaMA 7B that “behaves similarly to OpenAI’s text-davinci-003” but runs on much less powerful hardware.
After getting the LLaMA weights ourselves, we followed Willison’s instructions and got the 7B spec version running on an M1 Macbook Air and running at a reasonable speed. You call it as a script on the command line with a prompt, and LLaMA does its best to complete it in a logical manner.

Benj Edwards / Ars Technica
There is still the question of how much quantization affects output quality. In our tests, LLaMA 7B limited to 4-bit quantization was very impressive for running on a MacBook Air—but it wasn’t quite up to par with what you’d expect from ChatGPT. It is entirely possible that the best prompting techniques will have the best results.
Also, optimizations and micro-tunings come quickly when everyone has their hands on the code and the weights — although LLaMA still comes with some pretty restrictive terms of use. Stanford’s release of Alpaca today proves that fine tuning (additional training with a specific goal in mind) can improve performance, and it’s still early days after the LLaMA launch.
As of this writing, running LLaMA on a Mac remains a fairly technical exercise. You must have Python and Xcode installed and be familiar with working on the command line. Willison has good step-by-step instructions for anyone who wants to attempt this. But that may change soon as developers continue to push the code away.
As for the consequences of the presence of this technology in nature — no one knows yet. While some worry about the impact of AI as a spam and disinformation tool, Willison says, “It’s not going to be invented, so I think our priority should be to find the most constructive ways to use it.”
Right now, our only guarantee is that things will change quickly.