VRAM is king for local AI. The more you have, the larger the model you can load without quantization loss. Here’s what’s worth buying right now.
Budget Pick — NVIDIA RTX 4060 Ti (16 GB)
16 GB of VRAM at a sub-$500 price point. Runs 7B and 13B models at full precision, 34B at Q4. Great for getting started without breaking the bank.
NVIDIA GeForce RTX 4060 Ti 16GB
16 GB GDDR6 · 165W TDP · PCIe 4.0 x8 · Ada Lovelace architecture
Sweet Spot — NVIDIA RTX 3090 (24 GB)
Used RTX 3090s are the best value in local AI right now. 24 GB VRAM handles 34B models comfortably at Q4, and 70B models with smart quantization. Available used for $500–$700.
Powerhouse — NVIDIA RTX 4090 (24 GB)
The fastest consumer GPU available. 24 GB GDDR6X with enormous bandwidth — 1.5× faster than the 3090 for inference. If you’re serious about local AI and have the budget, this is the card.
NVIDIA GeForce RTX 4090
24 GB GDDR6X · 450W TDP · Ada Lovelace · Fastest consumer inference card
Multi-GPU — 2× RTX 3090 or 2× RTX 4090
For 70B+ models at full precision, you need multi-GPU. llama.cpp and Ollama both support tensor parallelism across multiple cards. Two RTX 3090s = 48 GB VRAM for under $1,400.
AMD Option — Radeon RX 7900 XTX (24 GB)
ROCm support has matured significantly. The 7900 XTX is a viable AMD alternative — especially on Linux where ROCm works natively with Ollama and llama.cpp.
AMD Radeon RX 7900 XTX
24 GB GDDR6 · 355W TDP · ROCm support on Linux · Strong competitor
Affiliate disclosure: Links use the selfhostailab-20 Amazon tag and Newegg affiliate program. Prices vary — always check current listings.