POST
/
v1
/
ai
/
transcribe
import { JigsawStack } from "jigsawstack";

const jigsawstack = JigsawStack({
  apiKey: "your-api-key",
});

// Basic transcription
const result = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
});

// With speaker diarization
const resultWithSpeakers = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  by_speaker: true,
});

// With a webhook for asynchronous processing
const asyncResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  webhook_url: "https://your-server.com/webhooks/transcription",
});

// Translating to English
const translationResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  translate: true,
});

// Using a specific language
const specificLanguageResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  language: "es-ES", // Spanish
});

// Using a file from storage
const fileStoreResult = await jigsawstack.audio.speech_to_text({
  file_store_key: "uploads/recording.mp3",
});
{
  "success": true,
  "text": "Hey guys, I'm pretty excited to talk about a new API that we're going to be releasing in Jigsaw Stack...",
  "chunks": [
    {
      "timestamp": [0.96, 7.2],
      "text": "Hey guys, I'm pretty excited to talk about a new API that we're going to be releasing in Jigsaw"
    },
    {
      "timestamp": [7.2, 14.56],
      "text": "Stack. It's our AI Scrape API, which can basically scrape any website by just prompting and giving"
    }
    // Additional chunks...
  ]
}

Speech to Text processing is billed at 1 invocation per 25 seconds of processing time. This refers to computational time, not audio length - a 1-hour audio file might only take 20 seconds to process, costing just 1 invocation.

Overview

Transform audio and video files into accurate text transcriptions using our powerful Speech to Text API. Built on Whisper large V3, this service provides:

  • High-accuracy transcription across multiple languages
  • Speaker identification for multi-person conversations
  • Translation capabilities for global content
  • Asynchronous processing via webhooks

Supported Formats and Limitations

  • Supported formats: MP3, WAV, M4A, FLAC, AAC, OGG, WEBM
  • Maximum file size: 500MB
  • Maximum audio duration: 4 hours

Request Parameters

Body

url
string

The video/audio url. Not required if file_store_key is specified.

file_store_key
string

The key used to store the video/audio file on Jigsawstack File Storage. Not required if url is specified.

Either url or file_store_key should be provided, not both.
language
string

The language to transcribe or translate the file into. If not specified, the model will automatically detect the language and transcribe accordingly. All supported language codes can be found here.

translate
boolean
default:
"false"

When set to true, translates the content into English (or the specified language if language parameter is provided). All supported language codes can be found here.

by_speaker
boolean
default:
"false"

Identifies and separates different speakers in the audio file. When enabled, the response will include a speakers array with speaker-segmented transcripts.

webhook_url
string

Webhook URL to send result to. When provided, the API will process asynchronously and send results to this URL when completed.

batch_size
number
default:
"30"

The batch size to return. Maximum value is 40. This controls how the audio is chunked for processing.

x-api-key
string
required

Your JigsawStack API key

Response Structure

Direct Response

success
boolean

Indicates whether the call was successful.

text
string

The complete transcribed text from the audio/video file.

chunks
array

An array of transcript chunks with timestamps.

speakers
array

Only present when by_speaker is set to true. Contains speaker-segmented transcripts.

Webhook Response

When using webhook_url, the initial response will be different:

success
boolean

Indicates whether the request was successfully queued.

status
string

Will be “processing” when the transcription job is queued successfully.

id
string

A unique identifier for the transcription job.

The complete transcription result will later be sent to your webhook URL with the same structure as the direct response.

Advanced Features

Speaker Diarization

Speaker diarization is the process of separating an audio stream into segments according to the identity of each speaker. When you enable the by_speaker parameter, the API will:

  1. Transcribe the audio as usual
  2. Identify distinct speakers in the recording
  3. Label each segment with a speaker identifier (e.g., “SPEAKER_1”, “SPEAKER_2”)
  4. Return both the standard chunks and a separate speakers array with speaker-separated transcriptions

This is particularly useful for:

  • Meeting transcriptions
  • Interview transcriptions
  • Podcast transcriptions
  • Any multi-speaker audio content

Webhook Usage

For long audio files, processing might take some time. Instead of keeping the connection open and waiting for the result, you can provide a webhook_url parameter. The API will:

  1. Return immediately with a job ID
  2. Process the audio asynchronously
  3. Send the complete transcription results to your webhook URL when finished

Make sure your webhook endpoint is set up to:

  • Accept POST requests
  • Parse JSON content
  • Handle the same response format as the standard API response
import { JigsawStack } from "jigsawstack";

const jigsawstack = JigsawStack({
  apiKey: "your-api-key",
});

// Basic transcription
const result = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
});

// With speaker diarization
const resultWithSpeakers = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  by_speaker: true,
});

// With a webhook for asynchronous processing
const asyncResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  webhook_url: "https://your-server.com/webhooks/transcription",
});

// Translating to English
const translationResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  translate: true,
});

// Using a specific language
const specificLanguageResult = await jigsawstack.audio.speech_to_text({
  url: "https://example.com/path/to/audio.mp4",
  language: "es-ES", // Spanish
});

// Using a file from storage
const fileStoreResult = await jigsawstack.audio.speech_to_text({
  file_store_key: "uploads/recording.mp3",
});
{
  "success": true,
  "text": "Hey guys, I'm pretty excited to talk about a new API that we're going to be releasing in Jigsaw Stack...",
  "chunks": [
    {
      "timestamp": [0.96, 7.2],
      "text": "Hey guys, I'm pretty excited to talk about a new API that we're going to be releasing in Jigsaw"
    },
    {
      "timestamp": [7.2, 14.56],
      "text": "Stack. It's our AI Scrape API, which can basically scrape any website by just prompting and giving"
    }
    // Additional chunks...
  ]
}