Small Models and Inference Hardware in 2026

bradfa · January 5, 2026, 1:53pm

I expect in 2026 that more specialized small to medium sized (20-50B parameter) large language models will become a big deal, along with tooling to manage/leverage a quiver of small to medium sized models easily for different workflows (probably agentic or assisted coding first, then others).

In some quick tests with Mistral’s devstral-small-2 (Devstral Small 2 - Mistral AI | Mistral Docs), I’ve found it to not quite be as good as the SOTA (state of the art) massive sized models like Claude Sonnet 4.5, but with some hand holding it can do a good job. It’s also significantly cheaper and faster than SOTA models.

I also expect that some company will start selling 64 to 128GB inference hardware for around $2500 USD which when running small to medium sized models (like devstral-small-2) can perform prompt processing and token generation at speeds that rival current SOTA cloud API performance. The current Strix Halo and DGX Spark boxes are neat, but they cannot run inference on these medium sized models fast enough. NVIDIA’s RTX Pro 6000 has 96GB of RAM and can run inference for medium sized models quite well, but it currently costs upwards of $9000 USD. There’s a market opportunity here and I hope a few very capable companies might take aim at it.

The combination of these two things will be a very big deal for personal local LLM hosting and I expect this will also induce a rather large contraction in the planned data center build-out which has dominated tech news lately. A contraction in planned data center build-out should also resolve the RAM pricing situation, although such a contraction is likely to have financial ripple effects for memory fabricators, so this might take longer than this year to happen.

I’m excited to watch the AI space this year! I think some very interesting things will happen.

bradfa · February 26, 2026, 6:15pm

Alternatively, I’d love to see a subscription cloud inference company focused on just smaller (<50B parameter) models. Small models mean super fast inference (>200 tokens/sec) when run on even previous generation datacenter grade hardware. Older generation GPUs are going to be cheaper to rent/run/own because everyone is chasing the latest generation hardware.

A big advantage of small models would be that they’re easy and fast to swap out, moving a 30B parameter model from CPU RAM into GPU VRAM is probably a 1-2 second operation. So with a typical GPU inference server, it’s super quick and easy to shuffle the ratio of models that are hot in the GPUs as demand changes and then even less popular models can easily and quickly “swap in” if customers want to keep using them so model deprecation becomes less of a concern.

This isn’t as ideal as my local-first desire, but the financials of a monthly subscription might be a lot easier for a lot more people.