Beta

What Even Is a Small Language Model Now?

15 May 2025

Yoeven D Khemlani

Share this article

What Even Is a Small Language Model Now?

If you asked someone in 2018 what a "small model" was, they'd probably say something with a few million parameters that ran on a Raspberry Pi or your phone. Fast-forward to today, and we're calling 30B parameter models "small"—because they only need one GPU to run.

So yeah, the definition of "small" has changed.

Small Used to Mean... Actually Small

Back in the early days of machine learning, a "small model" might've been a decision tree or a basic neural net that could run on a laptop CPU. Think scikit-learn, not LLMs.

Then came transformers and large language models (LLMs). As these got bigger and better, anything not requiring a cluster of A100s suddenly started to feel... small by comparison.

Today, small is more about how deployable the model is, not just its size on paper.

Types of Small Models (By 2025 Standards)

We now have two main flavors of small language models:

1. Edge-Optimized Models

These are the kind of models you can run on mobile devices or edge hardware. They're optimized for speed, low memory, and offline use.

Examples: Phi-3-mini (3.8B), Gemma 2B, TinyLlama (1.1B)
Use cases: voice assistants, translation on phones, offline summarization, chatbots embedded in apps

2. GPU-Friendly Models

These still require a GPU, but just one GPU—not a whole rack. In this category, even 30B or 70B models can qualify as "small".

Examples: Meta Llama 3 70B (quantized), MPT-30B
Use cases: internal RAG pipelines, chatbot endpoints, summarizers, code assistants

The fact that you can now run a 70B model on a single 4090 and get decent throughput? That would've been science fiction a few years ago.

Specialization: The Real Power Move

One big strength of small models is that they don't need to do everything. Unlike GPT-4 or Claude that try to be general-purpose brains, small models are often narrow and optimized.

That gives them a few key advantages:

They stay lean — no need to carry weights for tasks they’ll never do.
They’re more accurate in-domain — a small legal model will outperform a general-purpose LLM on legal docs.
They’re easier to fine-tune — less data, faster iteration.

Small models shine when you know what you want. Think: summarizing medical records, identifying security vulnerabilities, parsing invoices—stuff that doesn't need general reasoning across the internet.

30B+ Models: Still Small?

Sounds weird, but yes. The bar for what’s considered "small" keeps shifting.

With the right quantization and engineering, even a 70B model can run comfortably on a high-end consumer GPU:

Llama 3.1 70B can be shrunk from 140GB (FP16) to 21GB (2-bit), running on a single 24GB VRAM card.
Throughput? ~60 tokens/sec — totally usable for many production workloads.

So now we talk about models being "small" if they’re:

Deployable without distributed inference
Runnable on one GPU (especially consumer-grade)
Tunable without a lab full of TPUs

It’s less about size, more about practicality.

Everyday Small Models: The Unsung Heroes

Not all small models are new. Some of the most widely used models today have been around for years, quietly powering everyday tools we rely on.

Google Translate: Since 2006, it's been translating billions of words daily. In 2016, Google switched to a neural machine translation system, GNMT, which uses an encoder-decoder architecture with long short-term memory (LSTM) layers and attention mechanisms. This system, with over 160 million parameters, significantly improved translation fluency and accuracy.
AWS Textract: This service extracts text and data from scanned documents. It's been a staple in automating document processing workflows, handling everything from invoices to medical records.

These models may not be cutting-edge by today's standards, but they've been instrumental in shaping the AI landscape and continue to serve millions daily.

Why This Matters

Small models are becoming a huge deal:

Startups can deploy LLMs without spending six figures on infra.
Developers can run local models for privacy-focused apps.
Enterprises can fine-tune task-specific LLMs without massive overhead.

And when a "small model" can hold its own against GPT-3.5 in benchmarks? The game has officially changed.

TL;DR

Small models used to mean tiny. Now they mean "runs without drama."
You’ve got edge models, GPU-ready models, and everything in between.
Specialization is where small models shine.
30B and 70B models can be small—if they’re optimized well.
Practicality > parameter count.

In a world chasing ever-bigger models, small ones are quietly doing more with less—and that's exactly what makes them powerful.

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article