Beta

A Practical Zero-Shot Multimodal Classifier

16 Sep 2025

Harsha V Khurdula

Share this article

A Practical Zero-Shot Multimodal Classifier

Most classifiers ship with a fixed label list. They perform well inside that box and fail silently outside it. So we took this as a challenge. We introduce a classifier that starts from the opposite assumption: the world is open. New product names appear, new categories form, and multiple languages.

At JigsawStack, we treat classification as a language task in addition to open vocabulary classification. It accepts text or images, arbitrary label strings, including long and descriptive ones, built on top of Small Vision Language Models; the result is a practical zero-shot classification that works, from multilingual support tickets to real-world images without retraining.

Where It Slaps 🔥

1) Crosslingual Text Classification

...

Result

...

2) Moderation & Intent Classification

...

Result

...

3) Tagging (Multi-label) Classification

...

Result

...

4) IRL/Context Aware Image Classification

Input image:

Classification:

...

Result

...

5) Image Classification with Image as a Label Reference

Input Image:

Reference images as labels:

The two reference images above (side-by-side), which are passed as input, showcase two hands in poker; the input image is of a hand where the player has a ‘Royal Flush‘, which is the best and rarest hand in poker, consisting of the ace, king, queen, jack, and ten of the same suit. The current task is to classify the hand that the player has based on the reference images as follows

...

Result

...

Not Just Novel, but Smart

Open-set safety: You can pass a label “unknown/none of the above” instead of forcing a bad match to avoid false positives.
Label agility: Built with care for real-world generalization.
Multilingual by construction: prompts and labels can be in the user’s language; we support mixed scripts and cross-lingual matching.

Zero-shot multimodal classification isn’t magic or a feat that can be achieved by training all the labels in the world; rather, it’s language-conditioned decision-making with honest uncertainty. By treating labels as language and leveraging SLMs, we get a classifier that adapts for better generalization without a retraining cycle.

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article