Speech to Text
Transcribe video and audio files with ease leveraging Whisper large V3 AI model.
Speech to Text processing is billed at 1 invocation per 25 seconds of processing time. This refers to computational time, not audio length - a 1-hour audio file might only take 20 seconds to process, costing just 1 invocation.
Overview
Transform audio and video files into accurate text transcriptions using our powerful Speech to Text API. Built on Whisper large V3, this service provides:
- High-accuracy transcription across multiple languages
- Speaker identification for multi-person conversations
- Translation capabilities for global content
- Asynchronous processing via webhooks
Supported Formats and Limitations
- Supported formats: MP3, WAV, M4A, FLAC, AAC, OGG, WEBM
- Maximum file size: 500MB
- Maximum audio duration: 4 hours
Request Parameters
Body
The video/audio url. Not required if file_store_key
is specified.
The key used to store the video/audio file on Jigsawstack File
Storage. Not required if url
is
specified.
url
or file_store_key
should be provided, not both.The language to transcribe or translate the file into. If not specified, the model will automatically detect the language and transcribe accordingly. All supported language codes can be found here.
When set to true, translates the content into English (or the specified language if language
parameter is provided). All supported language codes can be found here.
Identifies and separates different speakers in the audio file. When enabled, the response will include a speakers
array with speaker-segmented transcripts.
Webhook URL to send result to. When provided, the API will process asynchronously and send results to this URL when completed.
The batch size to return. Maximum value is 40. This controls how the audio is chunked for processing.
Header
Your JigsawStack API key
Response Structure
Direct Response
Indicates whether the call was successful.
The complete transcribed text from the audio/video file.
An array of transcript chunks with timestamps.
Only present when by_speaker
is set to true. Contains speaker-segmented transcripts.
Webhook Response
When using webhook_url
, the initial response will be different:
Indicates whether the request was successfully queued.
Will be “processing” when the transcription job is queued successfully.
A unique identifier for the transcription job.
The complete transcription result will later be sent to your webhook URL with the same structure as the direct response.
Advanced Features
Speaker Diarization
Speaker diarization is the process of separating an audio stream into segments according to the identity of each speaker. When you enable the by_speaker
parameter, the API will:
- Transcribe the audio as usual
- Identify distinct speakers in the recording
- Label each segment with a speaker identifier (e.g., “SPEAKER_1”, “SPEAKER_2”)
- Return both the standard chunks and a separate
speakers
array with speaker-separated transcriptions
This is particularly useful for:
- Meeting transcriptions
- Interview transcriptions
- Podcast transcriptions
- Any multi-speaker audio content
Webhook Usage
For long audio files, processing might take some time. Instead of keeping the connection open and waiting for the result, you can provide a webhook_url
parameter. The API will:
- Return immediately with a job ID
- Process the audio asynchronously
- Send the complete transcription results to your webhook URL when finished
Make sure your webhook endpoint is set up to:
- Accept POST requests
- Parse JSON content
- Handle the same response format as the standard API response
Was this page helpful?