> ## Documentation Index > Fetch the complete documentation index at: https://jigsaw-13.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # Speech to Text > Transcribe video and audio files with ease leveraging Whisper large V3 AI model. ## Supported Formats and Limitations * **Supported formats:** MP3, WAV, M4A, FLAC, AAC, OGG, WEBM * **Maximum file size:** 100MB * **Maximum audio duration:** 4 hours ## Request Parameters ### Body The video/audio url. Not required if `file_store_key` is specified. The key used to store the video/audio file on Jigsawstack File [Storage](/docs/api-reference/store/file/add). Not required if `url` is specified. Either `url` or `file_store_key` should be provided, not both. The language to transcribe or translate the file into. Use "auto" for automatic language detection, or specify a language code. If not specified, defaults to automatic detection. All supported language codes can be found [here](https://jigsawstack.com/docs/additional-resources/languages). When set to true, translates the content into English (or the specified language if `language` parameter is provided). All supported language codes can be found [here](https://jigsawstack.com/docs/additional-resources/languages). Identifies and separates different speakers in the audio file. When enabled, the response will include a `speakers` array with speaker-segmented transcripts. Webhook URL to send result to. When provided, the API will process asynchronously and send results to this URL when completed. The batch size to return. Maximum value is 40. This controls how the audio is chunked for processing. The duration of each chunk in seconds. Maximum value is 15. This controls the duration of each chunk of audio that is processed. When set to true, returns each word as its own entry in the `chunks` array with its own start and end timestamp. Useful for caption alignment and word-accurate search. Cannot be combined with `stream=true`. Returns results as the audio is transcribed, instead of waiting for the full result. Good for live microphone input. Only supports `language=en`. Skip chunks that contain no speech. Only applies when `stream=true`. Sensitivity of speech detection, between `0` and `1`. Lower values detect more speech; higher values are stricter. Only applies when `stream=true` and `vad=true`. ## Response Structure Indicates whether the call was successful. Usage information for the API call. Number of input tokens processed. Number of output tokens generated. Number of tokens processed during inference time. Total number of tokens used (input + output). A unique identifier for the request The complete transcribed text from the audio/video file. An array of transcript chunks with timestamps. Array containing start and end time in seconds for the chunk. The transcribed text for this time segment. Only present when `by_speaker` is set to true. Contains speaker-segmented transcripts. The speaker identifier (e.g., "Speaker 1"). Array containing start and end time in seconds for this segment. The transcribed text spoken by this speaker. The language detected in the audio/video file. Available if `language` parameter is not provided or set to "auto". The confidence score for the language detected. Available if `language` parameter is not provided or set to "auto". ### Webhook Response When using `webhook_url`, the initial response will be different. Status of the transcription job.

`processing` - The transcription job is queued successfully
`error` - There was an issue with the transcription job

A unique identifier for the transcription job. The complete transcription result will later be sent to your webhook URL with the same structure as the direct response. ## Advanced Features ### Speaker Diarization Speaker diarization is the process of separating an audio stream into segments according to the identity of each speaker. When you enable the `by_speaker` parameter, the API will: 1. Transcribe the audio as usual 2. Identify distinct speakers in the recording 3. Label each segment with a speaker identifier (e.g., "SPEAKER\_1", "SPEAKER\_2") 4. Return both the standard chunks and a separate `speakers` array with speaker-separated transcriptions This is particularly useful for: * Meeting transcriptions * Interview transcriptions * Podcast transcriptions * Any multi-speaker audio content ### Word-level timestamps When you enable the `word_timestamps` parameter, each entry in the `chunks` array is a single word with its own start and end timestamp, instead of a multi-word segment. The API will: 1. Transcribe the audio as usual 2. Align each word against the audio waveform 3. Return one entry per word in the `chunks` array, in order This is particularly useful for: * Caption and subtitle alignment * Word-accurate jump-to-timestamp search * Video editors that cut on word boundaries * Karaoke-style highlighting `word_timestamps` cannot be combined with `stream=true`. ### Webhook Usage For long audio files, processing might take some time. Instead of keeping the connection open and waiting for the result, you can provide a `webhook_url` parameter. The API will: 1. Return immediately with a job ID 2. Process the audio asynchronously 3. Send the complete transcription results to your webhook URL when finished Make sure your webhook endpoint is set up to: * Accept POST requests * Parse JSON content * Handle the same response format as the standard API response ### Streaming Set `stream=true` to receive results as the audio is transcribed, instead of waiting for the full result. Good for live microphone input. **Requirements** * Pass `language=en` (streaming currently supports English only). * Send audio in the request body (raw bytes with the matching `Content-Type`, or as a `file` field in `multipart/form-data`). * `by_speaker` and `webhook_url` are not available with streaming. **Example request** ```bash theme={null} curl "https://api.jigsawstack.com/v1/ai/transcribe?stream=true&language=en" \ -X POST \ -H "x-api-key: your-api-key" \ -H "Content-Type: audio/wav" \ --data-binary @audio.wav ``` **Events** The response is a stream of events. Each event has a `type`: | Type | When it's sent | Useful fields | | -------------------- | ------------------------------------- | ------------------------------- | | `transcript.start` | Once, when transcription begins | — | | `transcript.segment` | For each recognized segment | `chunk.timestamp`, `chunk.text` | | `transcript.delta` | Alongside each segment, as plain text | `delta` | | `transcript.done` | Once, with the full transcript text | `text` | | `transcript.final` | Once, with the full structured result | `text`, `chunks` | | `transcript.error` | If transcription fails | `message` | ```javascript Javascript theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} import { JigsawStack } from "jigsawstack"; const jigsaw = JigsawStack({ apiKey: "your-api-key" }); const response = await jigsaw.audio.speech_to_text({ "url": "https://jigsawstack.com/preview/stt-example.wav" }) ``` ```python Python theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} from jigsawstack import JigsawStack jigsaw = JigsawStack(api_key="your-api-key") response = jigsaw.audio.speech_to_text({ "url": "https://jigsawstack.com/preview/stt-example.wav" }) ``` ```bash Curl theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} curl https://api.jigsawstack.com/v1/ai/transcribe \ -X POST \ -H 'Content-Type: application/json' \ -H 'x-api-key: your-api-key' \ -d '{"url":"https://jigsawstack.com/preview/stt-example.wav"}' ``` ```php PHP theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} 'https://jigsawstack.com/preview/stt-example.wav' }.to_json req_options = { use_ssl: uri.scheme == 'https' } res = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http| http.request(req) end ``` ```go Go theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} package main import ( "fmt" "io" "log" "net/http" "strings" ) func main() { client := &http.Client{} var data = strings.NewReader(`{"url":"https://jigsawstack.com/preview/stt-example.wav"}`) req, err := http.NewRequest("POST", "https://api.jigsawstack.com/v1/ai/transcribe", data) if err != nil { log.Fatal(err) } req.Header.Set("Content-Type", "application/json") req.Header.Set("x-api-key", "your-api-key") resp, err := client.Do(req) if err != nil { log.Fatal(err) } defer resp.Body.Close() bodyText, err := io.ReadAll(resp.Body) if err != nil { log.Fatal(err) } fmt.Printf("%s\n", bodyText) } ``` ```java Java theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} import java.io.IOException; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpRequest.BodyPublishers; import java.net.http.HttpResponse; HttpClient client = HttpClient.newHttpClient(); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create("https://api.jigsawstack.com/v1/ai/transcribe")) .POST(BodyPublishers.ofString("{\"url\":\"https://jigsawstack.com/preview/stt-example.wav\"}")) .setHeader("Content-Type", "application/json") .setHeader("x-api-key", "your-api-key") .build(); HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString()); ``` ```swift Swift theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} import Foundation let jsonData = [ "url": "https://jigsawstack.com/preview/stt-example.wav" ] as [String : Any] let data = try! JSONSerialization.data(withJSONObject: jsonData, options: []) let url = URL(string: "https://api.jigsawstack.com/v1/ai/transcribe")! let headers = [ "Content-Type": "application/json", "x-api-key": "your-api-key" ] var request = URLRequest(url: url) request.httpMethod = "POST" request.allHTTPHeaderFields = headers request.httpBody = data as Data let task = URLSession.shared.dataTask(with: request) { (data, response, error) in if let error = error { print(error) } else if let data = data { let str = String(data: data, encoding: .utf8) print(str ?? "") } } task.resume() ``` ```dart Dart theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} import 'package:http/http.dart' as http; void main() async { final headers = { 'Content-Type': 'application/json', 'x-api-key': 'your-api-key', }; final data = '{"url":"https://jigsawstack.com/preview/stt-example.wav"}'; final url = Uri.parse('https://api.jigsawstack.com/v1/ai/transcribe'); final res = await http.post(url, headers: headers, body: data); final status = res.statusCode; if (status != 200) throw Exception('http.post error: statusCode= $status'); print(res.body); } ``` ```kotlin Kotlin theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} import java.io.IOException import okhttp3.MediaType.Companion.toMediaType import okhttp3.OkHttpClient import okhttp3.Request import okhttp3.RequestBody.Companion.toRequestBody val client = OkHttpClient() val MEDIA_TYPE = "application/json".toMediaType() val requestBody = "{\"url\":\"https://jigsawstack.com/preview/stt-example.wav\"}" val request = Request.Builder() .url("https://api.jigsawstack.com/v1/ai/transcribe") .post(requestBody.toRequestBody(MEDIA_TYPE)) .header("Content-Type", "application/json") .header("x-api-key", "your-api-key") .build() client.newCall(request).execute().use { response -> if (!response.isSuccessful) throw IOException("Unexpected code $response") response.body!!.string() } ``` ```csharp C# theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} using System.Net.Http.Headers; using System.Net.Http.Json; HttpClient client = new HttpClient(); HttpRequestMessage request = new HttpRequestMessage(HttpMethod.Post, "https://api.jigsawstack.com/v1/ai/transcribe"); request.Headers.Add("x-api-key", "your-api-key"); request.Content = JsonContent.Create(new { url = "https://jigsawstack.com/preview/stt-example.wav" }); request.Content.Headers.ContentType = new MediaTypeHeaderValue("application/json"); HttpResponseMessage response = await client.SendAsync(request); response.EnsureSuccessStatusCode(); string responseBody = await response.Content.ReadAsStringAsync(); Console.WriteLine(responseBody); ``` ```json Response theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} theme={null} { "success": true, "text": " The little tales they tell are false The door was barred, locked and bolted as well Ripe pears are fit for a queen's table A big wet stain was on the round carpet The kite dipped and swayed but stayed aloft The pleasant hours fly by much too soon The room was crowded with a mild wob The room was crowded with a wild mob This strong arm shall shield your honour She blushed when he gave her a white orchid The beetle droned in the hot June sun", "chunks": [ { "timestamp": [ 0, 2.39 ], "text": " The little tales" }, { "timestamp": [ 2.39, 4.78 ], "text": "they tell are false" }, { "timestamp": [ 4.78, 7.130000000000001 ], "text": " The door was barred," }, { "timestamp": [ 7.130000000000001, 9.48 ], "text": "locked and bolted as well" }, { "timestamp": [ 9.48, 11.27 ], "text": " Ripe pears are fit" }, { "timestamp": [ 11.27, 13.06 ], "text": "for a queen's table" }, { "timestamp": [ 13.06, 15.149999999999999 ], "text": " A big wet stain" }, { "timestamp": [ 15.149999999999999, 17.24 ], "text": "was on the round carpet" }, { "timestamp": [ 17.24, 19.509999999999998 ], "text": " The kite dipped and" }, { "timestamp": [ 19.509999999999998, 21.78 ], "text": "swayed but stayed aloft" }, { "timestamp": [ 21.78, 24.04 ], "text": " The pleasant hours fly" }, { "timestamp": [ 24.04, 26.3 ], "text": "by much too soon" }, { "timestamp": [ 26.3, 28.53 ], "text": " The room was crowded" }, { "timestamp": [ 28.53, 30.76 ], "text": "with a mild wob" }, { "timestamp": [ 30.76, 32.92 ], "text": " The room was crowded" }, { "timestamp": [ 32.92, 35.08 ], "text": "with a wild mob" }, { "timestamp": [ 35.08, 37.16 ], "text": " This strong arm" }, { "timestamp": [ 37.16, 39.24 ], "text": "shall shield your honour" }, { "timestamp": [ 39.24, 41.59 ], "text": " She blushed when he" }, { "timestamp": [ 41.59, 43.94 ], "text": "gave her a white orchid" }, { "timestamp": [ 43.94, 46.22 ], "text": " The beetle droned in" }, { "timestamp": [ 46.22, 48.5 ], "text": "the hot June sun" } ], "_usage": { "input_tokens": 15, "output_tokens": 227, "inference_time_tokens": 526, "total_tokens": 768 } } ```