Choosing a GPU for local AI comes down to one number: VRAM. The more you have, the larger the model you can run, and the faster it goes. Here’s the full breakdown.
Why VRAM matters more than compute
When a model is loaded, it lives in VRAM. If your model doesn’t fit, it spills into system RAM — which is 10–50× slower for inference. VRAM is the bottleneck.
Quick reference:
- 7B model (Q4): ~4 GB VRAM
- 13B model (Q4): ~8 GB VRAM
- 34B model (Q4): ~20 GB VRAM
- 70B model (Q4): ~40 GB VRAM
The tiers
Entry: 8–12 GB VRAM
RTX 4070 (12 GB), RTX 3080 (10 GB). Fine for 7B models, tight for 13B. Good starting point if you already own one.
Sweet spot: 16–24 GB VRAM
RTX 4060 Ti 16 GB — the best new card at this tier. 16 GB handles 13B comfortably and 34B at Q4.
RTX 3090 (used) — 24 GB at $500–$700 on the used market. The best value in local AI right now. Runs 34B models cleanly.
RTX 4090 — 24 GB but with massively higher bandwidth than the 3090. Fastest consumer inference card. ~$1,800 new.
Pro tier: 48+ GB
Dual RTX 3090 (48 GB), RTX 6000 Ada (48 GB), or Quadro/Tesla cards. Needed for 70B at full precision or MoE models.
NVIDIA vs AMD
NVIDIA wins for local AI due to CUDA maturity. Ollama, llama.cpp, ComfyUI, and every major framework has first-class CUDA support.
AMD is viable on Linux with ROCm — the RX 7900 XTX (24 GB) is a solid alternative and often cheaper. Windows ROCm support is improving but not yet at parity.
My recommendation
Most people: Get a used RTX 3090 for ~$600. 24 GB covers everything up to 34B models and most 70B at Q4_K_M. CUDA Just Works.
Upgrading from a small card: RTX 4060 Ti 16 GB is the best new-market value at ~$450.
Serious setup: RTX 4090 or dual 3090s if you’re running 70B+ regularly.
See the full GPU picks → with buy links.