a friend bet me a beer i couldn't run a useful LLM on my 8GB thinkpad. i won the beer. the model is 'useful' in the way a rotary phone is useful: with sufficient patience, and only for certain things.
the recipe
- llama.cpp built with -DGGML_NATIVE=ON
- a 3B-parameter model quantized to Q4_K_M (1.8GB on disk)
- no GPU. cpu only. arch linux. delusion.
- patience. so much patience.
what it can do
summarize a 2000-word email into 200 words: yes, slowly, and mostly correctly. write a regex that matches IPv4: yes. write a regex that matches IPv6: optimistically, no. autocomplete code in a small file: yes, with a 3-second latency that taught me to be honest about whether i actually needed the next token.
$ llama-cli -m phi3-mini-q4.gguf -p "explain monads in one sentence"
# 142 tokens generated in 18.4s (7.7 tok/s)
# A monad is a design pattern that lets you sequence computations
# while threading some hidden context through them. okay actually
# that's a reasonable answer? credit where credit is due.
"the future is not here yet. the future is here but it's 8 tokens per second and uses 100% of your fan."
quantization is a hell of a drug. q4 is the floor of dignity; q3 is where it forgets what nouns are. anything below that is performance art. ten out of ten experiment. would underclock my fan again.