I expect in 2026 that more specialized small to medium sized (20-50B parameter) large language models will become a big deal, along with tooling to manage/leverage a quiver of small to medium sized models easily for different workflows (probably agentic or assisted coding first, then others).
In some quick tests with Mistral’s devstral-small-2 (Devstral Small 2 - Mistral AI | Mistral Docs), I’ve found it to not quite be as good as the SOTA (state of the art) massive sized models like Claude Sonnet 4.5, but with some hand holding it can do a good job. It’s also significantly cheaper and faster than SOTA models.
I also expect that some company will start selling 64 to 128GB inference hardware for around $2500 USD which when running small to medium sized models (like devstral-small-2) can perform prompt processing and token generation at speeds that rival current SOTA cloud API performance. The current Strix Halo and DGX Spark boxes are neat, but they cannot run inference on these medium sized models fast enough. NVIDIA’s RTX Pro 6000 has 96GB of RAM and can run inference for medium sized models quite well, but it currently costs upwards of $9000 USD. There’s a market opportunity here and I hope a few very capable companies might take aim at it.
The combination of these two things will be a very big deal for personal local LLM hosting and I expect this will also induce a rather large contraction in the planned data center build-out which has dominated tech news lately. A contraction in planned data center build-out should also resolve the RAM pricing situation, although such a contraction is likely to have financial ripple effects for memory fabricators, so this might take longer than this year to happen.
I’m excited to watch the AI space this year! I think some very interesting things will happen.
Alternatively, I’d love to see a subscription cloud inference company focused on just smaller (<50B parameter) models. Small models mean super fast inference (>200 tokens/sec) when run on even previous generation datacenter grade hardware. Older generation GPUs are going to be cheaper to rent/run/own because everyone is chasing the latest generation hardware.
A big advantage of small models would be that they’re easy and fast to swap out, moving a 30B parameter model from CPU RAM into GPU VRAM is probably a 1-2 second operation. So with a typical GPU inference server, it’s super quick and easy to shuffle the ratio of models that are hot in the GPUs as demand changes and then even less popular models can easily and quickly “swap in” if customers want to keep using them so model deprecation becomes less of a concern.
This isn’t as ideal as my local-first desire, but the financials of a monthly subscription might be a lot easier for a lot more people.
I’ve started working on a LLM-tool for myself which will let me search through my daily notes plus my Claude Code sessions. The goal is to be able to effectively chat with my notes and past Claude work, because a lot of times I have vague memories of things I’ve worked on before but I can’t easily find where to get more information about exactly what I did.
I have been keeping daily notes about the work I’ve done since 2020 using zim (https://zim-wiki.org/). I really like zim and because my notes are stored as just text files, syncing them between my desktop and laptop Just Works with syncthing.
Today I had the bright idea that if I had my email in text format locally that I could also add this data source into my LLM chat tool! Then I could really tie together email conversations, my notes, and Claude Code sessions to get an even better picture of things I’ve worked on and more quickly find the source of truth in my own past experiences.
However, this kind of data, being used by an LLM, is starting to make me a bit paranoid when using a cloud based inference provider. As much as some of these cloud providers promise not to store or train on your prompts, having my personal detailed notes, Claude Code sessions, and especially email contents flying out onto the cloud and trusting a 3rd party to actually abide by their privacy policy feels quite risky and leaves me concerned…
So I’m back to considering how to get good-enough local inference using smaller models, especially for this kind of chat-with-my-files type use-case!
Neat, this is a powerful model. I recently started using Obsidian. It stores all notes in directories and markdown files, named after the title of the note, so it works very well with external tools. There are also lots of plugins that I’ve not started exploring yet.
I just tried this prompt in Claude Code:
I use thunderbird mail, can you summarize my discussions for the past week? Look in Archives and Sent, as well as all my customer folders.
Took a bit, but worked amazingly well. Claude wrote a temporary Python script, so I had Claude turn that into a skill.
One of the problems with LLMs is you need to run them in a sandboxed env without network access, unless you trust them, and this drastically limits their usefulness. @bradfa have you looked much into the security aspects yet? How we can be sure nothing is hidden in some these “open” models?
I’ve also noticed in the OpenClaw docs and other information, that they recommend the latest/largest models as being move secure and reliable (less likely to do insecure/stupid stuff).
Model choice matters: older/smaller/legacy models are significantly less robust against prompt injection and tool misuse. For tool-enabled agents, use the strongest latest-generation, instruction-hardened model available.
I don’t yet have enough compute capability locally to really leverage local LLMs the way I want to, but I have started thinking about this some. I have a standing todo for myself to setup a systemd-nspawn container to host local LLM inference, but I haven’t gotten around to implementing it yet.
I’m most interested in Mistral’s and AllenAI’s open weight models. Being from the US or EU feels more trustworthy than the Chinese models, to me as a westerner. But I’m also very interested in AllenAI’s models as they publish a LOT of information about how they’re created.
But even if I don’t fully restrict local inference hosting in an air tight container, simply not sending my prompts to a 3rd party is a HUGE step towards retaining my data and improving my AI security.
I still have a tremendous amount to learn about all this. I think there’s going to be great opportunities for smart people who do understand the “full stack” of local AI. I’m not sure if I’ll be one of those people, but it currently sure is interesting to me.