15 May 2025
If you asked someone in 2018 what a "small model" was, they'd probably say something with a few million parameters that ran on a Raspberry Pi or your phone. Fast-forward to today, and we're calling 30B parameter models "small"—because they only need one GPU to run.
So yeah, the definition of "small" has changed.
Back in the early days of machine learning, a "small model" might've been a decision tree or a basic neural net that could run on a laptop CPU. Think scikit-learn, not LLMs.
Then came transformers and large language models (LLMs). As these got bigger and better, anything not requiring a cluster of A100s suddenly started to feel... small by comparison.
Today, small is more about how deployable the model is, not just its size on paper.
We now have two main flavors of small language models:
These are the kind of models you can run on mobile devices or edge hardware. They're optimized for speed, low memory, and offline use.
These still require a GPU, but just one GPU—not a whole rack. In this category, even 30B or 70B models can qualify as "small".
The fact that you can now run a 70B model on a single 4090 and get decent throughput? That would've been science fiction a few years ago.
One big strength of small models is that they don't need to do everything. Unlike GPT-4 or Claude that try to be general-purpose brains, small models are often narrow and optimized.
That gives them a few key advantages:
Small models shine when you know what you want. Think: summarizing medical records, identifying security vulnerabilities, parsing invoices—stuff that doesn't need general reasoning across the internet.
Sounds weird, but yes. The bar for what’s considered "small" keeps shifting.
With the right quantization and engineering, even a 70B model can run comfortably on a high-end consumer GPU:
So now we talk about models being "small" if they’re:
It’s less about size, more about practicality.
Not all small models are new. Some of the most widely used models today have been around for years, quietly powering everyday tools we rely on.
Google Translate: Since 2006, it's been translating billions of words daily. In 2016, Google switched to a neural machine translation system, GNMT, which uses an encoder-decoder architecture with long short-term memory (LSTM) layers and attention mechanisms. This system, with over 160 million parameters, significantly improved translation fluency and accuracy.
AWS Textract: This service extracts text and data from scanned documents. It's been a staple in automating document processing workflows, handling everything from invoices to medical records.
These models may not be cutting-edge by today's standards, but they've been instrumental in shaping the AI landscape and continue to serve millions daily.
Small models are becoming a huge deal:
And when a "small model" can hold its own against GPT-3.5 in benchmarks? The game has officially changed.
In a world chasing ever-bigger models, small ones are quietly doing more with less—and that's exactly what makes them powerful.
Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!