21 Mar 2025
The past few weeks have been awesome in the AI Audio space, with a new ASR model from ElevenLabs, to the open source launch of Nvidia’s 1b Canary model.
OpenAI’s open-source model Whisper 3 large/turbo has been the gold standard for transcription which in our eyes remains pretty much undefeated when you compare the full feature set of languages, word level timestamps, and the ability the tune the model to add more languages and improve performance!
Many optimizations exist like faster whisper and our version Insanely Fast Whisper API which beats the base Whisper 3 & 2 provided by OpenAI through their API way out of the park.
OpenAI launched a closed-source model gpt-4o-transcribe
that claims to be the “next-generation” of audio models, outperforming the best in the market by benchmarking themselves against Whisper 3, Gemini 2 and Nova.
What is Speech-to-Text/ ASR / Transcription? It’s a type of model that takes in an audio file of a speaker or multiple speakers and transcribes that into text. Some basic features all good Transcription models should have:
Time stamps (Either word-level or sentence-level)
Speaker recognition
Large files & long audio support
Blazing fast speeds for short audio (Real-time use case)
Multilingual support & Translation
High accuracy (WER)
We’ll be comparing OpenAI gpt-4o-transcribe against JigsawStack Speech-to-text in a series of real-world examples.
OpenAI gpt-4o-transcribe | JigsawStack Speech-to-text | |
---|---|---|
⏳ Timestamps | No support was found on their docs. ❌ | Sentence level timestamp by default and speaker level timestamp available. ✅ |
💬 Speaker recognition | No support was found on their docs. ❌ | Speaker recognition through diarization is available with timestamp support per speaker. Up to 50 speakers supported. ✅ |
⛰️ Large file & long audio support | Supported up to 25 MB file size and 25 mins of audio per request. ❌ | Supports up to 100 MB file size and 4 hours of audio per request. ✅ |
⚡ Performance | Almost ~2.4x slower than JigsawStack ❌ | ~2.4x faster transcription on all audio lengths while providing more data like timestamps ✅ |
🌍 Multilingual support | Great support for transcribing most popular languages ✅ | JigsawStack came in close second for Native multilingual transcription ◐✅ |
🔄 Translation | No support was found on their docs ❌ | Built it translation support with over 100+ languages ✅ |
📝 Accuracy | No exact WER data but got a 93% accurate transcription on our real-world test. ❌ | Got 97% accurate transcription on real-world test. ✅ |
👯♀️ Team Size | ~5,300 people working at OpenAI | 3 people working at JigsawStack |
Here’s the simple JS/TS script we’re gonna use to test everything
OpenAI’s Transcription
JigsawStack’s Transcription
Both gave 100% accurate results, however OpenAI currently doesn’t support timestamps.
OpenAI’s Transcription
JigsawStack’s Transcription
JigsawStack managed to accurately break down each sentence with timestamps while OpenAI provided a large chunk of text with no breakdown.
OpenAI’s Transcription
JigsawStack’s Transcription
We’ll have to update the above test script to add by_speaker: true
to the JigsawStack SDK to generate speaker tags. It should look something like this.
Result
OpenAI latest transcription model doesn’t have support for speaker recognition and tagging which makes up a huge chunk of audio transcription use cases. JigsawStack easily transcribed the above audio both by timestamp and by speaker tags under 15 seconds.
We’re pushing a huge update in the coming week that will improve the speaker recognition output performance by 10x for longer audios while increasing quality
OpenAI’s Transcription
JigsawStack’s Transcription
Too large for this blog, view full output here: https://gist.github.com/yoeven/a2c1c6eae388e34c350ffe1f407f8af5
OpenAI doesn’t support large files above 25 MB or audio durations above 25 minutes. JigsawStack transcribe a 32 MB file that’s over 1hr 35mins long under 30 seconds at high accuracy with a sentence by sentence timestamp breakdown.
Same audio file we used at the start for the timestamp example.
OpenAI and JigsawStack Results (Lower is better)
We ran 3 runs on the same audio file, initially on the first run, OpenAI was close to JigsawStack’s performance but still slower by a 300ms. By the second and third run, OpenAI’s model slowed down significantly taking almost double the time.
Having a great model is not enough, having the right infrastructure set up to run it at scale is equally as important. This is why we train our models specific to the infra we’re building on.
OpenAI and JigsawStack Results (Lower is better)
OpenAI’s Transcription
JigsawStack’s Transcription
Once again we see JigsawStack outperforming OpenAI’s latest & best speech-to-text model with double the performance. You can also see that OpenAI’s model wrongly transcribed the audio with only one oh no
but there was two.
Same audio file we used at the start for the timestamp example.
OpenAI and JigsawStack Results (Lower is better)
JigsawStack outperforms OpenAI even in longer audio by 2.4x faster performance while being more accurate!
We’ll add a language config on both OpenAI’s and JigsawStack’s SDKs to let it know the audio is in Hindi. It should look something like this.
OpenAI’s Transcription
JigsawStack’s Transcription
OpenAI had an higher accuracy for native Hindi transcription compared to JigsawStack which was a close second.
We’ll use the same Hindi audio sample
We’ll add a language config on both OpenAI’s and JigsawStack’s SDKs to let it know to translate the audio to English. It should look something like this:
OpenAI’s Transcription
JigsawStack’s Transcription
OpenAI’s latest model has no support for translation or prompting which prevents the translation setting from working. On the other hand, JigsawStack natively support translation into most languages. You can find all supported languages on our docs here.
OpenAI’s Result
JigsawStack’s Result
While OpenAI claimed to have a lower WER than the rest, in this example of a noisy example, JigsawStack got 100% accurate results while OpenAI has missing and inaccurate transcription making it only 94% accurate in this example.
JigsawStack Text (97% accurate)
You are wearing the target speech steering system. You can click the button on the right and then look at me for a few seconds. Now when you are looking at me, the system will register my voice and enroll it. Now you can take a walk. So now the system has an enrollment of my voice, it can extract*, it can focus on only my voice while ignoring all the interfering sounds in the environment. So we introduced this system called click on CD, where* we*,* like suppose we're in a scenario like this where you're trying to hear my voice when someone else is trying to speak, then You can just look at my, you can look at me for a few seconds, get some noisy example of my voice, and then you can separate out or filter out my voice only from everybody else's.
OpenAI’s text (93% accurate)
Hey, you're wearing the target speech hearing system. You can click the button on the right and then look at me for a few seconds. Now when you're looking at me, the system will register my voice and enroll it. Now you can take a walk. So now the system has an enrollment of my voice, it can focus on only my voice while ignoring all the interfering sounds in the environment. So we introduced this system called Sequence in Use, where suppose we're in a scenario like this, where you're trying to hear my voice when someone else is trying to speak, then you can just look at me for a few seconds, get some noisy example of my voice, and then you can separate out or filter out my voice only from everyone else's.
“it can extract” is missing from OpenAI’s text
Guessed the wrong name “Sequence” when it’s suppose to be “Look Once to Hear”
missing “we” and “like”
We’re huge fans of the work being done at OpenAI, especially the open source work! This article wouldn’t be possible without Whisper 3. The base model we optimized and trained to get the output you see here!
In all the above tests we used JigsawStack’s transcription model and OpenAI’s latest transcription model. You can run all tests yourself using these APIs and the script provided at the start of this article.
Overall JigsawStack outperformance OpenAI’s best in class speech to text model, in performance, accuracy, features and more!
OpenAI gpt-4o-transcribe | JigsawStack Speech-to-text | |
---|---|---|
⏳ Timestamps | No support was found on their docs. ❌ | Sentence level timestamp by default and speaker level timestamp available. ✅ |
💬 Speaker recognition | No support was found on their docs. ❌ | Speaker recognition through diarization is available with timestamp support per speaker. Up to 50 speakers supported. ✅ |
⛰️ Large file & long audio support | Supported up to 25 MB file size and 25 mins of audio per request. ❌ | Supports up to 100 MB file size and 4 hours of audio per request. ✅ |
⚡ Performance | Almost ~2.4x slower than JigsawStack ❌ | ~2.4x faster transcription on all audio lengths while providing more data like timestamps ✅ |
🌍 Multilingual support | Great support for transcribing most popular languages ✅ | JigsawStack came in close second for Native multilingual transcription ◐✅ |
🔄 Translation | No support was found on their docs ❌ | Built it translation support with over 100+ languages ✅ |
📝 Accuracy | No exact WER data but got a 93% accurate transcription on our real-world test. ❌ | Got 97% accurate transcription on real-world test. ✅ |
👯♀️ Team Size | ~5,300 people working at OpenAI | 3 people working at JigsawStack |
Our goal at JigsawStack is to build small focused models that are specialized at doing one thing very well while making it accessible with great DX!
Typescript/Javascript example
Python
We want to go deep dive into TTS so we’re writing a separate article breaking down the comparison. You can try it out here in the mean time. Here’s a small sneak peak on how it sounds 👇
Join us on Discord or follow us on X to keep up to date with our future launches and the TTS comparison