26 Nov 2024 • 7 min read
Speech-to-text technology is reshaping how we build and interact with apps, automate processes, and capture insights on the go. For startups and developers, choosing the right solution can significantly impact product outcomes and scaling costs.
This article will break down JigsawStack, Groq, AssemblyAI, and OpenAI across factors like latency (speed), feature depth, language support, cost and more. By the end, you’ll have a clearer view of which provider best aligns with your technical and business goals.
Criteria | JigsawStack | Groq | AssemblyAI | OpenAI |
---|---|---|---|---|
Model | Insanely-fast-whisper | Whisper-large-v3-turbo | Universal-1 | Whisper-2 |
Latency (5s audio) | 765ms | 631ms | 4s | 12s |
Latency (3m video) | 2.7s | 3.5s | 7.8s | 10s |
Latency (30m video) | 11s | 12s | 29s | 91s |
Latency (1hr 35m video) | 27s | Error out | 42s | Error out |
Word Error Rate (WER) | 10.30% | 12% | 8.70% | 10.60% |
Diarization Support | Yes | No | Yes | No |
Timestamp | Sentence level | Sentence level | Word level | Sentence level |
Large File | Up to 100MB | Up to 25MB | 5GB | Up to 25MB |
Automatic | Yes | Yes | Yes | Yes |
Streaming Support | No | No | Yes | No |
Pricing | $0.05/hr | $0.04/hr | $0.37/hr | $0.36/hr |
Best For | Speed, Low cost, Production apps | Low cost and lightweight app | Real-time transcription apps |
Note: Tests were conducted using each provider’s SDK on a controlled dataset. For transparency, the implementation code is available here
Performance was evaluated over 10 iterations on each of 4 sample files:
A 5 second audio file
A 3 minute video file
A 30 minute audio file
A 1 hour 35 minute audio file
JigsawStack: Processed in 765 milliseconds
Groq: Processed in 631 milliseconds
AssemblyAI: Processed in 4 seconds
OpenAI: Processed in 12 seconds
JigsawStack: Processed in 2.7 seconds
Groq: Processed in 3.5 seconds
AssemblyAI: Processed in 7.8 seconds
OpenAI: Processed in 10 seconds
JigsawStack: Processed in 11 seconds
Groq: Processed in 12 seconds
AssemblyAI: Processed in 29 seconds
OpenAI: Processed in 91 seconds
JigsawStack: Processed in 27 seconds
AssemblyAI: Processed in 42 seconds
Groq: Error out (Large file)
OpenAI: Error out (Large file)
JigsawStack consistently outperformed across both audio and video formats with a good balance on performance for short and long files. On average it makes it nearly twice as fast as AssemblyAI and overall faster than Groq in 3 out of 4 tests, which experienced difficulties handling larger files. Notably, OpenAI showed the slowest performance.
For shorter audio (less than ~10 seconds), JigsawStack and Groq were pretty close to the edge on performance with Groq being ~100ms faster overall demonstrating exceptional efficiency for time-sensitive transcription needs. Its reliability and speed across varied file sizes reinforce its suitability as a top choice for developers and startups prioritizing rapid processing without compromising accuracy.
Precision is essential in voice-driven AI applications, and Word Error Rate (WER) is the primary metric for evaluating speech-to-text accuracy. A low WER translates to fewer transcription errors, resulting in reliable data that powers applications like summaries, customer insights, metadata tagging, and action item identification. Below is a WER comparison across the models:
Provider | Model | WER (%) |
---|---|---|
AssemblyAI | Universal-1 | 8.7 |
JigsawStack | insanely-fast-whisper | 10.3 |
Groq | whisper-large-v3-turbo | 12 |
OpenAI | whisper-2 | 10.6 |
Finding the right balance between a low WER, high performance and low cost is a good way to decide which model makes sense for your application.
Speaker diarization distinguishes and tracks individual speakers within an audio file, providing clarity on who said what.
At present:
JigsawStack and AssemblyAI support speaker diarization, enabling precise speaker attribution.
Groq and OpenAI do not currently offer this feature.
JigsawStack: Offers sentence-level timestamps.
AssemblyAI: Provides word-level timestamps.
Groq: Supports sentence-level timestamps.
OpenAI: Supports sentence-level timestamps.
In most cases, sentence-level timestamp would fit your use case. However, if you need word level timestamp, AssemblyAI would be a good fit at the cost of performance & speed.
An alternative method would be to convert a sentence-level timestamp to word level by splitting each word by time estimates for the whole sentence which would reduce the timestamp accuracy but keep the performance of the model.
AssemblyAI: Supports audio files up to 5 GB
JigsawStack: Handles audio and video files up to 100 MB.
Groq: Supports audio files up to 25 MB
OpenAI : Supports audio files up to 25 MB
For applications requiring support for extremely large files, AssemblyAI offers the best compatibility with its 5 GB file size limit at the huge cost of performance.
For large files, the suggested approach would be to chunk the file into smaller bits using a library like PyDub, then process them concurrently with a faster model and combine the output.
All providers offer automatic language detection, allowing for seamless transcription across multiple languages. This capability makes them well-suited for global applications with diverse linguistic content.
Note: We saw a slight performance improvement across all models when the language is predefined in the requests.
AssemblyAI provides real-time transcription via WebSocket streaming, making it ideal for applications requiring immediate/async feedback.
Currently, JigsawStack, OpenAI and Groq do not offer streaming capabilities, which may limit their suitability for real-time use cases.
Groq and JigsawStack have sub-millisecond response time for short audios which could also be used as an alternative for real-time applications if WebSocket connections are not a requirement.
Each provider offers free access for experimentation, making it easier for startups to prototype. Below is a breakdown of their paid pricing options:
Provider | Model | Price | Rate limits |
---|---|---|---|
Groq | whisper-large-v3-turbo | $0.04/hr | High rate limits |
JigsawStack | insanely-fast-whisper | $0.05/hr | - |
OpenAI | whisper-2 | $0.36/hr | Tier-based rate limits |
Assembly AI | Universal-1 | $0.37/hr | - |
JigsawStack offers the best quality-to-price ratio, ideal for both startups and enterprises looking to scale without worry.
We ran the benchmarks in CodeSandbox to provide a consistent environment for each run with consistent internet connection and speed. You can find the Github code repo here
Choosing the right speech-to-text (STT) provider hinges on aligning your specific needs with the strengths of each platform. Here's a breakdown to help guide your decision:
JigsawStack has the right balance between between speed, cost-efficiency, and versatility. With industry-leading processing times, robust accuracy, and highly competitive pricing ($0.05/hr), it stands out as the most balanced option for scalable transcription solutions.
AssemblyAI shines with its advanced streaming support for real-time use cases like live transcription and interactive voice applications. Its word-level timestamps and exceptional large-file support (up to 5 GB) also make it a strong contender for developers seeking precision and high capacity, albeit at a significantly higher cost ($0.37/hr) and slower performance.
Groq offers the most affordable pricing ($0.04/hr) but falls short in performance when handling larger files and lacks features like speaker diarization along with high rate limits. This makes it better suited for budget-conscious non-production use cases with smaller files and simpler requirements.
OpenAI performed worse across all metrics as it struggles with large and small files, making it less adaptable for varied or high-demand workloads. High prices ($0.36/hr) is less competitive given its limitations. OpenAI is also running a older model Whisper 2 larger rather than the latest Whisper 3 large (Turbo).
For most startups and developers, JigsawStack delivers the best blend of speed, affordability, and features, making it the top pick for cost-effective, high-value transcription needs. If real-time streaming is critical, AssemblyAI emerges as the superior choice. Ultimately, understanding your technical needs—such as file size, real-time demands, or budget constraints—will be key in making the right selection.
Pair your Speech to Text (STT) integration with a Text to Speech (STT) model allowing your apps to speak in 80+ different languages, styles and accents with JigsawStacks recently launched TTS model.
Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!