Audio
Speech to Text
Transcribe video and audio files with ease leveraging Whisper large V3 AI model.
POST
Documentation Index
Fetch the complete documentation index at: https://jigsaw-13.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Supported Formats and Limitations
- Supported formats: MP3, WAV, M4A, FLAC, AAC, OGG, WEBM
- Maximum file size: 100MB
- Maximum audio duration: 4 hours
Request Parameters
Body
The video/audio url. Not required if
file_store_key is specified.The key used to store the video/audio file on Jigsawstack File Storage. Not required if
url is specified.Either
url or file_store_key should be provided, not both.The language to transcribe or translate the file into. Use “auto” for automatic language detection, or specify a language code. If not specified, defaults to automatic detection. All supported language codes can be found here.
When set to true, translates the content into English (or the specified language if
language parameter is provided). All supported language codes
can be found here.Identifies and separates different speakers in the audio file. When enabled, the response will include a
speakers array with speaker-segmented
transcripts.Webhook URL to send result to. When provided, the API will process asynchronously and send results to this URL when completed.
The batch size to return. Maximum value is 40. This controls how the audio is chunked for processing.
The duration of each chunk in seconds. Maximum value is 15. This controls the duration of each chunk of audio that is processed.
When set to true, returns each word as its own entry in the
chunks array with its own start and end timestamp. Useful for caption alignment and word-accurate search. Cannot be combined with stream=true.Returns results as the audio is transcribed, instead of waiting for the full result. Good for live microphone input. Only supports
language=en.Skip chunks that contain no speech. Only applies when
stream=true.Sensitivity of speech detection, between
0 and 1. Lower values detect more speech; higher values are stricter. Only applies when stream=true and vad=true.Header
Your JigsawStack API key
Response Structure
Indicates whether the call was successful.
Usage information for the API call.
A unique identifier for the request
The complete transcribed text from the audio/video file.
An array of transcript chunks with timestamps.
Only present when
by_speaker is set to true. Contains speaker-segmented transcripts.The language detected in the audio/video file. Available if
language parameter is not provided or set to “auto”.The confidence score for the language detected. Available if
language parameter is not provided or set to “auto”.Webhook Response
When usingwebhook_url, the initial response will be different.
Status of the transcription job.
processing- The transcription job is queued successfullyerror- There was an issue with the transcription job
A unique identifier for the transcription job.
Advanced Features
Speaker Diarization
Speaker diarization is the process of separating an audio stream into segments according to the identity of each speaker. When you enable theby_speaker parameter, the API will:
- Transcribe the audio as usual
- Identify distinct speakers in the recording
- Label each segment with a speaker identifier (e.g., “SPEAKER_1”, “SPEAKER_2”)
- Return both the standard chunks and a separate
speakersarray with speaker-separated transcriptions
- Meeting transcriptions
- Interview transcriptions
- Podcast transcriptions
- Any multi-speaker audio content
Word-level timestamps
When you enable theword_timestamps parameter, each entry in the chunks array is a single word with its own start and end timestamp, instead of a multi-word segment. The API will:
- Transcribe the audio as usual
- Align each word against the audio waveform
- Return one entry per word in the
chunksarray, in order
- Caption and subtitle alignment
- Word-accurate jump-to-timestamp search
- Video editors that cut on word boundaries
- Karaoke-style highlighting
word_timestamps cannot be combined with stream=true.
Webhook Usage
For long audio files, processing might take some time. Instead of keeping the connection open and waiting for the result, you can provide awebhook_url parameter. The API will:
- Return immediately with a job ID
- Process the audio asynchronously
- Send the complete transcription results to your webhook URL when finished
- Accept POST requests
- Parse JSON content
- Handle the same response format as the standard API response
Streaming
Setstream=true to receive results as the audio is transcribed, instead of waiting for the full result. Good for live microphone input.
Requirements
- Pass
language=en(streaming currently supports English only). - Send audio in the request body (raw bytes with the matching
Content-Type, or as afilefield inmultipart/form-data). by_speakerandwebhook_urlare not available with streaming.
type:
| Type | When it’s sent | Useful fields |
|---|---|---|
transcript.start | Once, when transcription begins | — |
transcript.segment | For each recognized segment | chunk.timestamp, chunk.text |
transcript.delta | Alongside each segment, as plain text | delta |
transcript.done | Once, with the full transcript text | text |
transcript.final | Once, with the full structured result | text, chunks |
transcript.error | If transcription fails | message |