How Modern LLMs Learn New Skills Without Forgetting Old Ones

The big shift in 2026 is that AI models can pick up new capabilities without being retrained from scratch. Here's how that actually works.

Abstract neural network visualisation - representing modular AI architecture

Updated 14 June 2026 How we review →

By Rob14 June 2026 · 4 min read

For a long time, the way AI models 'learned' something new was crude: retrain the whole model from scratch on more data. Expensive, slow, and risky - the new training often degraded skills the model had previously been good at, an effect researchers call 'catastrophic forgetting'.

In 2026, the technical picture is more interesting. Models are increasingly built as collections of specialised modules that can be added, removed or updated independently. The headline implication for everyday users: AI is going to keep getting better in specific areas without the periodic 'whole-product reset' feeling we got with each big model release in 2023-2024.

Mixture-of-experts in plain terms

The architecture behind most frontier models in 2026 is 'mixture-of-experts' (MoE). The idea is straightforward:

Instead of one enormous neural network that processes every query, the model has many smaller expert networks - each trained to be good at a different kind of task.
When you ask a question, a 'router' decides which experts are relevant and only activates those.
For a coding question, the coding experts activate. For a medical question, the medical experts activate. For chit-chat, a smaller, faster expert handles it.

This is why GPT-5 and Claude 4.7 can be much bigger than GPT-4 in total parameter count without being correspondingly slower or more expensive: most of the model is asleep on any given query.

Why modular matters

The practical advantage isn't speed (though MoE helps there). It's that you can train a NEW expert separately and plug it in, without having to retrain everything else.

AllenAI's published work in 2026 demonstrates this clearly with their open-source models: they trained domain experts (legal reasoning, scientific paper comprehension, multilingual translation) as independent training runs, then combined them into a single MoE model. The combined model gained the new capabilities without measurably losing existing skills.

For frontier proprietary models (OpenAI, Anthropic, Google), the same architectural principle is at work even though the specifics are kept private. The pace of capability addition has accelerated in 2026 partly because of this - new tool-use, new reasoning modes, new language coverage land as point releases rather than waiting for the next big model version.

What it changes for everyday use

A few practical consequences for a normal AI user:

Capability improvements happen continuously. The era of 'wait 12 months for GPT-5' is over. Model providers ship new specialist abilities as point releases - improved coding in a Tuesday update, better long-document handling the following week. Worth keeping an eye on release notes if you care about a specific use case.
The 'best model for X' question gets sharper. Claude is genuinely better at long-context reasoning + coding. Gemini is better at retrieving structured information from Google's ecosystem. GPT-5 is better at agentic multi-step tasks. These aren't marketing positioning - they reflect which domain experts each provider has invested most in training.
Open-source models close the gap faster. Meta's Llama 4, Mistral's Pixtral 3, Alibaba's Qwen 3 are all MoE-based and benefit from the same point-release acceleration. The open-source frontier is roughly 6-12 months behind the proprietary one in 2026, vs. 18-24 months in 2024.

Where this leaves training-from-scratch

Training a frontier model from scratch is still extremely expensive (hundreds of millions of dollars in compute), and still gates the underlying intelligence level. What's changed is the post-training step - the additional training done after the base model is built. Post-training used to be a single monolithic run; in 2026 it's increasingly modular, parallel, and additive.

For everyone NOT running a frontier lab, the practical implication is that you don't need to retrain a model to make it good at YOUR domain - you can fine-tune a small adapter (LoRA, QLoRA) on your data + slot it on top of an open-source MoE base. The cost has dropped to where small teams can do this for under £500 in compute.

The bottom line

You don't need to know how mixture-of-experts works to use AI well. But it helps explain a few things you'll notice in 2026: why models keep getting better in specific narrow ways without big version-number jumps, why open-source models keep closing the gap, why providers can afford to ship faster improvements without periodic 'we trained a new model' announcements.

The architecture has moved from 'one big network' to 'a collection of specialised modules orchestrated by a router'. Each module can improve independently. That's the engineering shift that makes the 2026 AI experience feel like continuous improvement rather than discrete release cycles.