High-VRAM GPUs aren't the future of local AI — unified memory and Mixture of…

The topic High-VRAM GPUs aren’t the future of local AI — unified memory and Mixture of… is currently the subject of lively discussion — readers and analysts are keeping a close eye on developments.

This is taking place in a dynamic environment: companies’ decisions and competitors’ reactions can quickly change the picture.

If you’ve spent any time around local AI, you’ve absorbed the same rule of thumb everyone else has: more VRAM is better, and a discrete GPU stuffed with it is the dream. It’s not bad advice, as any graphics card with fast memory will chew through any model small enough to fit inside it. For the last few years, the path to running bigger models locally was simply buying a card with more VRAM.

That path has quietly hit a wall in recent years. Consumer VRAM has stalled, with the RTX 5090 at the top-end sitting at 32GB, while the open-weight models worth running have grown into hundreds of billions of parameters in some instances. A 32GB card can’t even load most of the really game-changing large open-weight models these days. However, hope isn’t lost for those models, as the interesting work in local AI has moved to a different kind of machine entirely: unified-memory systems running mixture of experts (MoE) models. The combination lets a comparatively slow box hold and usefully run models that a 5090 has no hope of touching.

It’s not all upsides when using these unified memory machines, as in most cases, the bandwidth is comparatively mediocre and prompt processing on long inputs can be slower. However, for the specific job of running the biggest models you can get your hands on, it works very well, and nothing in the consumer GPU world comes close.

You see, there are two separate phases that run every time a model generates an output.The first is prefill, which is when the model reads your prompt, and it tends to be much more compute-bound because the model can process the prompt in parallel. Conceptually, it’s doing enormous matrix operations over the input rather than generating one token at a time. However, reading replies back in the form of a token is the decode stage, and it’s memory-bandwidth-bound. Every token re-reads the model’s weights out of memory, so generation speed is roughly the bandwidth divided by how many bytes each token has to read. Fewer weights per token means a faster generation speed.

That split is incredibly important when assessing your VRAM requirements, as it’s not just a question of “How much?” but also “How fast?” You need capacity to hold the model, sure, but generating from it quickly needs bandwidth. A discrete GPU gives you pretty fast bandwidth but caps capacity quite hard, whereas a unified memory machine with more than 32GB of RAM isn’t exactly uncommon.

Unified memory works like this: instead of a small pool of very fast memory bolted to a GPU, the CPU and GPU share one big coherent pool, with no copying back and forth across a bus. There are three major players doing this right now, and they all have, roughly, the same result. Apple Silicon goes furthest on both ends, with the M3 Ultra Mac Studio reaching 512GB at around 800 GB/s. Nvidia’s GB10, the chip inside the DGX Spark and Lenovo’s ThinkStation PGX, gives you 128GB at 273 GB/s with the entire CUDA stack behind it. AMD’s Strix Halo, the Ryzen AI Max+ 395 in machines like the Framework Desktop, offers 128GB at roughly 256 GB/s for the lowest cost of the three.

None of those bandwidth figures are impressive when compared with a discrete GPU that can clear 1,000 GB/s easily. So, on pure decode speed for a model that fits in both, the GPU wins comfortably. What unified memory buys you is the capacity to hold models that have nowhere to live on a 32GB card. for example, when we tested the Mac Studio, we were able to run DeepSeek R1’s full 671B at Q4 on it, generating at a speed of approximately 15 to 20 tokens per second. That’s a 400GB model that a 5090 can’t load under any circumstances.

In terms of cost, there’s more at play than the asking price of either the $9,500 Mac Studio or the ThinkStation PGX’s $5,000. If you price up a five-figure multi-GPU server you’d otherwise need to hold 400GB of weights, all of a sudden that cost doesn’t seem so ludicrous. Whether spending several thousands of dollars for a local inference machine is worth it is another question entirely, but the unified memory machines are cheaper than multi-GPU setups you could otherwise build.

The catch, though, is bandwidth. If you put a large dense model on one of these machines, you’ll have a miserable time, because every token goes through every single weight using that constrained memory pipe. Clearly, unified memory on its own isn’t the answer, and that’s where mixture of experts comes in.

A mixture of experts model splits its feed-forward layers into many separate experts and only routes each token through a handful of them. The total parameter count, the part that has to live in memory, stays huge, but the active count, the part actually read per token, stays small. DeepSeek R1 is 671B total with only 37B active, Qwen3 Coder Next is 80B, but only a minuscule 3B active parameters per token. Finally, Step-3.5-Flash is 196B with 11B active parameters. All of these have one thing in common: the model is enormous on paper, but behaves more like a small one while it’s generating.

That’s where these models start to line up with unified memory. To have more parameters in your model, you need capacity, which you get a lot of in these machines. Your active parameters need bandwidth, and MoE keeps the number of active parameters low enough that the performance hit is tiny when compared to a dense model of the same size. In other words, you get to keep a frontier-class brain resident in memory and generate from it at the cost of a model a fraction of the size.

In recent years, the frontier open-weight trend has gone overwhelmingly in the direction of mixture of experts model. DeepSeek-V4-Pro, released in April, is 1.6T total with just 49B active, around 3% of the model lit up per token, and it’s under an MIT license. Qwen3.5-397B-A17B activates 17B of its 397B across 512 experts and beats Alibaba’s own larger trillion-parameter model while decoding several times faster at long context. The people building the models and the people building unified-memory hardware have, without coordinating, placed the same bet.

Say you’re setting up a local LLM for coding, and you settle on three options: Devstral 2 123B, Qwen3 Coder Next 80B, and Gemma 4 31B. Devstral 2 is the most capable model of the three on paper, but every token reads all of its parameters, and tokens per second will likely be measured in single digits. That’s borderline unusable for back-and-forth interactive coding. Qwen3-Coder-Next, though, is an 80B MoE that activates only around 3B per token, so it generates several times faster on the same machine while being a bigger, generally stronger coding model than the small dense option. In my experience, it could achieve anything from 40 to 60 tokens per second on the ThinkStation PGX. Gemma 4 at 31B, meanwhile, looks like the safe, fast choice, except it’s dense too: it reads all 30.7B of its parameters per token, roughly 10x more than Qwen3-Coder-Next, so it isn’t even particularly fast, and it’s the weakest of the three on top of that.

In this instance, the 80B MoE model is by far the best to deploy. It’s bigger, it’s significantly faster than the other two options, and you don’t have to make a speed versus quality trade-off. It beats Gemma 4 31B in both knowledge and in speed, and it beats Devstral 2 by virtue of being actually usable. Gemma 4 gets nothing for being small: it’s middling on speed, and outclassed on capability.

MoE models also let you pull off some tricks to improve performance with reduced VRAM; for example, I offloaded gpt-oss-120b’s expert layers to system RAM on a 24GB VRAM card, and I still had a usable LLM with a generation speed of 20 tokens per second.

Unfortunately, MoE models do have compromises. A 235B-A22B model is typically smarter than a 22B dense one thanks to all that stored knowledge, but whether it matches a true 235B dense model is less clear-cut than it might seem, and it also appears to depend on the task. At equal total parameters, MoEs tend to hold their own on knowledge and memorization-heavy work, but research has found they can lag dense models on reasoning, which appears to benefit more from parameters applied per token than from sheer stored capacity. The mechanism is reasonably well understood: because routing sends each expert only a subset of tokens, individual experts get restricted exposure to the training distribution, which can constrain their generalization unless the model actively spreads knowledge across the expert pool through load balancing, shared experts, or mutual distillation between experts.

Whether an optimized MoE can be pushed to match or beat a dense model at strictly equal total parameters, compute, and data is an active research question rather than a settled result at present. There’s also a routing step that costs you a little per token, and the irregular token-by-token routing hurts memory locality in a way that a single user (where you can’t amortize weight reads across a big batch) will feel more than a server would.

What an MoE model gives you is the best opportunity to actually fit and run a large model in a usable fashion on your own hardware, and you’re paying a small per-active-parameter penalty for the privilege of holding a huge model in memory that your machine can actually use quickly. Given that the alternative is not running the model at all, I’ll take that trade every time.

The bigger weak spot is prefill, and MoE models don’t really help here. Prompt processing is compute-bound, and these machines aren’t exactly compute monsters. Apple Silicon in particular has no tensor cores, so feeding it a long context and waiting before the first token appears can be painful when compared to a CUDA GPU doing the same work. If your workflow is short prompts and long generations, you’ll barely notice, but providing a huge amount of context will considerably slow down your responses. Nvidia’s GB10 narrows the gap here as it brings tensor cores and CUDA to unified-memory, but discrete GPUs are still faster overall.

For anything that fits comfortably inside its VRAM, a discrete GPU is still the faster, simpler choice. However, what matters for local AI now isn’t how fast your memory is, but rather whether the model fits and runs at all. In that regard, high-VRAM GPUs aren’t really the answer, as the options with the most VRAM have largely stayed the same. Meanwhile, unified memory and MoE models have grown at the same time, and it’s clear that bigger models are the future.

With the 32GB VRAM ceiling seemingly not going anywhere anytime soon, unified memory machines are the true workhorses for local AI. And if bandwidth remains the primary constraint, then the best way to deal with that is to use a mixture of experts model. There are plenty of great options out there, and what you lose in intelligence you gain back in usability.