I expect in 2026 that more specialized small to medium sized (20-50B parameter) large language models will become a big deal, along with tooling to manage/leverage a quiver of small to medium sized models easily for different workflows (probably agentic or assisted coding first, then others).
In some quick tests with Mistral’s devstral-small-2 (Devstral Small 2 - Mistral AI | Mistral Docs), I’ve found it to not quite be as good as the SOTA (state of the art) massive sized models like Claude Sonnet 4.5, but with some hand holding it can do a good job. It’s also significantly cheaper and faster than SOTA models.
I also expect that some company will start selling 64 to 128GB inference hardware for around $2500 USD which when running small to medium sized models (like devstral-small-2) can perform prompt processing and token generation at speeds that rival current SOTA cloud API performance. The current Strix Halo and DGX Spark boxes are neat, but they cannot run inference on these medium sized models fast enough. NVIDIA’s RTX Pro 6000 has 96GB of RAM and can run inference for medium sized models quite well, but it currently costs upwards of $9000 USD. There’s a market opportunity here and I hope a few very capable companies might take aim at it.
The combination of these two things will be a very big deal for personal local LLM hosting and I expect this will also induce a rather large contraction in the planned data center build-out which has dominated tech news lately. A contraction in planned data center build-out should also resolve the RAM pricing situation, although such a contraction is likely to have financial ripple effects for memory fabricators, so this might take longer than this year to happen.
I’m excited to watch the AI space this year! I think some very interesting things will happen.