Most classifiers ship with a fixed label list. They perform well inside that box and fail silently outside it. So we took this as a challenge. We introduce a classifier that starts from the opposite assumption: the world is open. New product names appear, new categories form, and multiple languages.
At JigsawStack, we treat classification as a language task in addition to open vocabulary classification. It accepts text or images, arbitrary label strings, including long and descriptive ones, built on top of Small Vision Language Models; the result is a practical zero-shot classification that works, from multilingual support tickets to real-world images without retraining.
Result
Result
Result
Input image:

Classification:
Result
Input Image:

Reference images as labels:

The two reference images above (side-by-side), which are passed as input, showcase two hands in poker; the input image is of a hand where the player has a âRoyal Flushâ, which is the best and rarest hand in poker, consisting of the ace, king, queen, jack, and ten of the same suit. The current task is to classify the hand that the player has based on the reference images as follows
Result
Zero-shot multimodal classification isnât magic or a feat that can be achieved by training all the labels in the world; rather, itâs language-conditioned decision-making with honest uncertainty. By treating labels as language and leveraging SLMs, we get a classifier that adapts for better generalization without a retraining cycle.
Have questions or want to show off what youâve built? Join the JigsawStack developer community on Discord and X/Twitter. Letâs build something amazing together!