JigsawStack Logo

Beta

JigsawStack vs Groq vs AssemblyAI vs OpenAI Speech-to-Text Benchmark Comparison

Share this article

JigsawStack vs Groq vs AssemblyAI vs OpenAI Speech-to-Text Benchmark Comparison

Speech-to-text technology is reshaping how we build and interact with apps, automate processes, and capture insights on the go. For startups and developers, choosing the right solution can significantly impact product outcomes and scaling costs.

This article will break down JigsawStack, Groq, AssemblyAI, and OpenAI across factors like latency (speed), feature depth, language support, cost and more. By the end, you’ll have a clearer view of which provider best aligns with your technical and business goals.

JigsawStack vs Groq vs AssemblyAI vs OpenAI: A quick overview

CriteriaJigsawStackGroqAssemblyAIOpenAI
ModelInsanely-fast-whisperWhisper-large-v3-turboUniversal-1Whisper-2
Latency (5s audio)765ms631ms4s12s
Latency (3m video)2.7s3.5s7.8s10s
Latency (30m video)11s12s29s91s
Latency (1hr 35m video)27sError out42sError out
Word Error Rate (WER)10.30%12%8.70%10.60%
Diarization SupportYesNoYesNo
TimestampSentence levelSentence levelWord levelSentence level
Large FileUp to 100MBUp to 25MB5GBUp to 25MB
AutomaticYesYesYesYes
Streaming SupportNoNoYesNo
Pricing$0.05/hr$0.04/hr$0.37/hr$0.36/hr
Best ForSpeed, Low cost, Production appsLow cost and lightweight appReal-time transcription apps

Note: Tests were conducted using each provider’s SDK on a controlled dataset. For transparency, the implementation code is available here

Latency

Performance was evaluated over 10 iterations on each of 4 sample files:

  1. A 5 second audio file

  2. A 3 minute video file

  3. A 30 minute audio file

  4. A 1 hour 35 minute audio file

Results for 5 second Audio File (5kb)

  • JigsawStack: Processed in 765 milliseconds

  • Groq: Processed in 631 milliseconds

  • AssemblyAI: Processed in 4 seconds

  • OpenAI: Processed in 12 seconds

Results for 3 minute Video File (4.48 MB)

  • JigsawStack: Processed in 2.7 seconds

  • Groq: Processed in 3.5 seconds

  • AssemblyAI: Processed in 7.8 seconds

  • OpenAI: Processed in 10 seconds

Results for 30 minute Audio File (16 MB)

  • JigsawStack: Processed in 11 seconds

  • Groq: Processed in 12 seconds

  • AssemblyAI: Processed in 29 seconds

  • OpenAI: Processed in 91 seconds

Results for 1 hour 35 minute Audio File (34.4 MB)

  • JigsawStack: Processed in 27 seconds

  • AssemblyAI: Processed in 42 seconds

  • Groq: Error out (Large file)

  • OpenAI: Error out (Large file)

Key Takeaways

JigsawStack consistently outperformed across both audio and video formats with a good balance on performance for short and long files. On average it makes it nearly twice as fast as AssemblyAI and overall faster than Groq in 3 out of 4 tests, which experienced difficulties handling larger files. Notably, OpenAI showed the slowest performance.

For shorter audio (less than ~10 seconds), JigsawStack and Groq were pretty close to the edge on performance with Groq being ~100ms faster overall demonstrating exceptional efficiency for time-sensitive transcription needs. Its reliability and speed across varied file sizes reinforce its suitability as a top choice for developers and startups prioritizing rapid processing without compromising accuracy.

Word Error Rate

Precision is essential in voice-driven AI applications, and Word Error Rate (WER) is the primary metric for evaluating speech-to-text accuracy. A low WER translates to fewer transcription errors, resulting in reliable data that powers applications like summaries, customer insights, metadata tagging, and action item identification. Below is a WER comparison across the models:

ProviderModelWER (%)
AssemblyAIUniversal-18.7
JigsawStackinsanely-fast-whisper10.3
Groqwhisper-large-v3-turbo12
OpenAIwhisper-210.6

Finding the right balance between a low WER, high performance and low cost is a good way to decide which model makes sense for your application.

Speaker Diarization

Speaker diarization distinguishes and tracks individual speakers within an audio file, providing clarity on who said what.

At present:

  • JigsawStack and AssemblyAI support speaker diarization, enabling precise speaker attribution.

  • Groq and OpenAI do not currently offer this feature.

Timestamp Features

  • JigsawStack: Offers sentence-level timestamps.

  • AssemblyAI: Provides word-level timestamps.

  • Groq: Supports sentence-level timestamps.

  • OpenAI: Supports sentence-level timestamps.

In most cases, sentence-level timestamp would fit your use case. However, if you need word level timestamp, AssemblyAI would be a good fit at the cost of performance & speed.

An alternative method would be to convert a sentence-level timestamp to word level by splitting each word by time estimates for the whole sentence which would reduce the timestamp accuracy but keep the performance of the model.

Large File Support

  • AssemblyAI: Supports audio files up to 5 GB

  • JigsawStack: Handles audio and video files up to 100 MB.

  • Groq: Supports audio files up to 25 MB

  • OpenAI : Supports audio files up to 25 MB

For applications requiring support for extremely large files, AssemblyAI offers the best compatibility with its 5 GB file size limit at the huge cost of performance.

For large files, the suggested approach would be to chunk the file into smaller bits using a library like PyDub, then process them concurrently with a faster model and combine the output.

Automatic Language Detection

All providers offer automatic language detection, allowing for seamless transcription across multiple languages. This capability makes them well-suited for global applications with diverse linguistic content.

Note: We saw a slight performance improvement across all models when the language is predefined in the requests.

Streaming Support

AssemblyAI provides real-time transcription via WebSocket streaming, making it ideal for applications requiring immediate/async feedback.

Currently, JigsawStack, OpenAI and Groq do not offer streaming capabilities, which may limit their suitability for real-time use cases.

Groq and JigsawStack have sub-millisecond response time for short audios which could also be used as an alternative for real-time applications if WebSocket connections are not a requirement.

Pricing

Each provider offers free access for experimentation, making it easier for startups to prototype. Below is a breakdown of their paid pricing options:

ProviderModelPriceRate limits
Groqwhisper-large-v3-turbo$0.04/hrHigh rate limits
JigsawStackinsanely-fast-whisper$0.05/hr-
OpenAIwhisper-2$0.36/hrTier-based rate limits
Assembly AIUniversal-1$0.37/hr-

JigsawStack offers the best quality-to-price ratio, ideal for both startups and enterprises looking to scale without worry.

Benchmark environment

We ran the benchmarks in CodeSandbox to provide a consistent environment for each run with consistent internet connection and speed. You can find the Github code repo here

Conclusion

Choosing the right speech-to-text (STT) provider hinges on aligning your specific needs with the strengths of each platform. Here's a breakdown to help guide your decision:

  • JigsawStack has the right balance between between speed, cost-efficiency, and versatility. With industry-leading processing times, robust accuracy, and highly competitive pricing ($0.05/hr), it stands out as the most balanced option for scalable transcription solutions.

  • AssemblyAI shines with its advanced streaming support for real-time use cases like live transcription and interactive voice applications. Its word-level timestamps and exceptional large-file support (up to 5 GB) also make it a strong contender for developers seeking precision and high capacity, albeit at a significantly higher cost ($0.37/hr) and slower performance.

  • Groq offers the most affordable pricing ($0.04/hr) but falls short in performance when handling larger files and lacks features like speaker diarization along with high rate limits. This makes it better suited for budget-conscious non-production use cases with smaller files and simpler requirements.

  • OpenAI performed worse across all metrics as it struggles with large and small files, making it less adaptable for varied or high-demand workloads. High prices ($0.36/hr) is less competitive given its limitations. OpenAI is also running a older model Whisper 2 larger rather than the latest Whisper 3 large (Turbo).

For most startups and developers, JigsawStack delivers the best blend of speed, affordability, and features, making it the top pick for cost-effective, high-value transcription needs. If real-time streaming is critical, AssemblyAI emerges as the superior choice. Ultimately, understanding your technical needs—such as file size, real-time demands, or budget constraints—will be key in making the right selection.

Extra

Pair your Speech to Text (STT) integration with a Text to Speech (STT) model allowing your apps to speak in 80+ different languages, styles and accents with JigsawStacks recently launched TTS model.

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article