NJVERSE // BOOT
>NJVERSE OS v3.14 — BOOT SEQUENCE INITIATED
>loading kernel modules...
>mounting /sys/identity... OK
>applying user preferences...
>spawning interface threads...
>connecting BKK :: 13.7563°N, 100.5018°E
>SYSTEM READY
~/ / posts / 0x07
POST 0x07//ml//llm2025.10.02 // 12 min read

notes on running an LLM on a potato

8GB of RAM, no GPU, vibes-only inference. quantization is a hell of a drug.

NJ
Nattapong Jaisabai
Software Engineer · published 2025.10.02

a friend bet me a beer i couldn't run a useful LLM on my 8GB thinkpad. i won the beer. the model is 'useful' in the way a rotary phone is useful: with sufficient patience, and only for certain things.

the recipe

  • llama.cpp built with -DGGML_NATIVE=ON
  • a 3B-parameter model quantized to Q4_K_M (1.8GB on disk)
  • no GPU. cpu only. arch linux. delusion.
  • patience. so much patience.

what it can do

summarize a 2000-word email into 200 words: yes, slowly, and mostly correctly. write a regex that matches IPv4: yes. write a regex that matches IPv6: optimistically, no. autocomplete code in a small file: yes, with a 3-second latency that taught me to be honest about whether i actually needed the next token.

$ llama-cli -m phi3-mini-q4.gguf -p "explain monads in one sentence"
# 142 tokens generated in 18.4s (7.7 tok/s)
# A monad is a design pattern that lets you sequence computations
# while threading some hidden context through them. okay actually
# that's a reasonable answer? credit where credit is due.

"the future is not here yet. the future is here but it's 8 tokens per second and uses 100% of your fan."

quantization is a hell of a drug. q4 is the floor of dignity; q3 is where it forgets what nouns are. anything below that is performance art. ten out of ten experiment. would underclock my fan again.

EOF · 0x07 · last edit 2025.10.02// thanks for reading.
NEXT →
i replaced my entire CI with 80 lines of bash. it slaps.
2025.11.05 · //tools · 7 min
← back to all posts