24 Mar 2025
Following the footsteps of our previous benchmarking of Mistral OCR, we now put Google's Gemini OCR to the test against JigsawStack vOCR. With both technologies claiming robust capabilities in text extraction, we've conducted real-world tests across a variety of document types and languages to determine which solution performs better in practical applications.
Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:
Multilingual text recognition capability
Ability to process both handwritten and printed text
Provision of precise bounding boxes for spatial positioning
Structured data extraction and formatting
Context understanding and intelligent interpretation
Consistency and accuracy across various document types
To compare the systems fairly, we provided Gemini OCR with a structured JSON prompt that closely mirrored JigsawStack vOCR’s native output format. This prompt instructed Gemini OCR to return:
Bounding boxes for words and lines
Structured text sections
Metadata, including document dimensions and detected tags
By enforcing this format, we ensured that both OCR solutions returned comparable outputs, making it easier to assess their real capabilities beyond differences in default response formatting.
Feature | Gemini OCR (Standard) | JigsawStack vOCR |
---|---|---|
Multilingual support | Moderate support, struggles with certain Asian languages | Strong support across 70+ languages including less common ones ✅ |
Handwriting recognition | Basic recognition with moderate accuracy | Superior accuracy with context preservation ✅ |
Bounding boxes | Not provided in standard output | Comprehensive word and line-level positioning data ✅ |
Structured output | Good JSON formatting but requires post-processing | Native structured output with customizable fields ✅ |
Processing speed | Fast processing (3-6 seconds) ✅ | Slightly slower (5-9 seconds) |
Context understanding | Good basic extraction but lacks semantic understanding | Better preservation of document context and relationships ✅ |
Feature | Gemini OCR (Structured Output) | JigsawStack vOCR |
---|---|---|
Multilingual support | Moderate support but significantly slower processing | Strong support across 70+ languages with consistent performance ✅ |
Handwriting recognition | Significantly degraded accuracy with bounding box requests | Superior accuracy maintained ✅ |
Bounding boxes | Incomplete or inaccurate positioning data with high latency | Comprehensive word and line-level positioning data with reasonable speed ✅ |
Structured output | Struggles with combined spatial and semantic data | Native structured output with semantic understanding ✅ |
Processing speed | Extremely slow (30-38 seconds) when spatial data requested | Consistent performance (5-9 seconds) ✅ |
Context understanding | Further degraded when forced into structured format | Maintains better preservation contextual relationships ✅ |
We tested Gemini OCR and JigsawStack vOCR using four document types:
Receipt Processing – Extracting totals, taxes, and itemized entries.
Multilingual Recognition – Handling mixed-language street signs.
Handwritten Text Recognition – Transcribing cursive and stylized handwriting.
Structured Document Processing (PDFs) – Extracting tabular data and financial details.
Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:
{
"success": true,
"context": {},
"width": 1000,
"height": 750,
"tags": ["text", "document"],
"has_text": true,
"sections": [
{
"text": "Extracted text here",
"lines": [
{
"text": "Line text",
"bounds": {
"top_left": { "x": 100, "y": 50 },
"bottom_right": { "x": 300, "y": 70 }
},
"words": [
{
"text": "Word",
"bounds": {
"top_left": { "x": 110, "y": 55 },
"bottom_right": { "x": 140, "y": 65 }
}
}
]
}
]
}
]
}
import os
import json
import time
import google.generativeai as genai
from concurrent.futures import ThreadPoolExecutor
from jigsawstack import JigsawStack, JigsawStackError
from google.generativeai.types import Content, GenerateContentConfig, Part
# Define test images and prompts
TEST_IMAGES = [
"test_files/sample_receipt.jpg",
"test_files/sample_handwriting.jpg",
"test_files/sample_multilingual.jpg",
"test_files/sample_pdf.pdf",
]
PROMPTS = {
"test_files/sample_receipt.jpg": "Extract the total price, tax, and all itemized entries from this receipt.",
"test_files/sample_handwriting.jpg": "Transcribe all handwritten text from this image, ensuring accuracy in cursive and print styles.",
"test_files/sample_multilingual.jpg": "Extract all text from this image, identifying different languages and preserving formatting.",
"test_files/sample_pdf.pdf": "Extract all structured text and maintain the document's section hierarchy from this PDF."
}
OUTPUT_FOLDER = "benchmark_results"
os.makedirs(OUTPUT_FOLDER, exist_ok=True)
def generate(
file: str = "sample_receipt.jpg",
model: str = "gemini-2.0-flash",
prompt: list = ["Extract the total price, tax, and all itemized entries from this receipt."],
):
client = genai.Client(api_key=GEMINI_API_KEY)
files = [
client.files.upload(file=file),
]
contents = [
Content(
role="user",
parts=[
Part.from_uri(
file_uri=files[0].uri,
mime_type=files[0].mime_type,
),
Part.from_text(text="Based on the given file and its contents perform vOCR to obtain results for each of the following:\n" + "\n".join(prompt)),
],
),
]
generate_content_config = GenerateContentConfig(
temperature=0.6,
top_p=0.95,
top_k=40,
max_output_tokens=8192,
response_mime_type="application/json",
)
content = client.models.generate_content(
model=model,
contents=contents,
config=generate_content_config
)
return content
def call_gemini_ocr(image_path):
"""Uses Gemini Pro Vision for OCR with optimized prompts"""
if not os.path.exists(image_path):
print(f"Error: File {image_path} not found.")
return None, None
try:
with open(image_path, "rb") as img_file:
image_data = img_file.read()
prompt = PROMPTS.get(image_path, "Extract all text from this image.")
start_time = time.perf_counter()
response = generate(file=image_path, prompt=[prompt])
latency = time.perf_counter() - start_time
if not response or not hasattr(response, "text"):
print(f"Warning: Gemini API returned unexpected response for {image_path}")
return None, latency
return response.text, latency
except Exception as e:
print(f"Gemini API Error: {e}")
return None, None
def upload_to_jigsawstack(image_path):
"""Uploads file to JigsawStack File Storage and returns file_store_key"""
if not os.path.exists(image_path):
print(f"Error: File {image_path} not found.")
return None
jigsawstack = JigsawStack(api_key=JIGSAWSTACK_API_KEY)
try:
with open(image_path, "rb") as image_file:
image_data = image_file.read()
result = jigsawstack.store.upload(
image_data, {"filename": os.path.basename(image_path), "overwrite": True}
)
result = result.json()
file_key = result['key']
if not file_key:
print(f"Error: JigsawStack did not return a valid key for {image_path}")
return None
return file_key
except JigsawStackError as err:
print(f"Error uploading {image_path} to JigsawStack: {err}")
return None
except Exception as e:
print(f"Unexpected error during JigsawStack upload: {e}")
return None
def call_jigsawstack_vocr(image_path):
"""Calls JigsawStack vOCR using file_store_key and optimized prompts"""
jigsawstack = JigsawStack(api_key=JIGSAWSTACK_API_KEY)
file_store_key = upload_to_jigsawstack(image_path)
if not file_store_key:
print(f"Skipping JigsawStack vOCR for {image_path} due to upload failure")
return None, None
prompt = PROMPTS.get(image_path, "Describe the image in detail.")
start_time = time.perf_counter()
try:
result = jigsawstack.vision.vocr({"file_store_key": file_store_key, "prompt": prompt})
result = result.json()
latency = time.perf_counter() - start_time
return result, latency
except JigsawStackError as err:
print(f"Error processing {image_path} with vOCR: {err}")
return str(err), None
except Exception as e:
print(f"Unexpected error during JigsawStack vOCR: {e}")
return str(e), None
def process_image(image):
"""Runs OCR on a single image using both APIs"""
print(f"Processing {image}...")
gemini_result, gemini_latency = call_gemini_ocr(image)
jigsawstack_result, jigsawstack_latency = call_jigsawstack_vocr(image)
save_results(image, gemini_result, gemini_latency, jigsawstack_result, jigsawstack_latency)
def run_benchmark():
"""Runs benchmark tests in parallel"""
try:
with ThreadPoolExecutor() as executor:
executor.map(process_image, TEST_IMAGES)
print("Benchmarking complete. Results saved in", OUTPUT_FOLDER)
except Exception as e:
print(f"Error during benchmark execution: {e}")
if __name__ == "__main__":
try:
run_benchmark()
except Exception as e:
print(f"Fatal error in benchmark script: {e}")
We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.
Gemini with Unstructured Output:
{
"total": "144.02",
"tax": "4.58",
"items": [
{
"name": "TATER TOTS",
"price": "2.96"
},
{
"name": "HARD/PROV/DC",
"price": "2.68"
},
{
"name": "SNACK BARS",
"price": "4.98"
},
// Additional items truncated for brevity
]
}
Processed in 6.43 seconds with no spatial data
Gemini With Structured Output and Detecting Bounding Boxes:
{
"total": "144.02",
"tax": "4.58",
"items": [
{
"name": "TATER TOTS",
"price": "2.96"
},
{
"name": "HARD/PROV/DC",
"price": "2.68"
},
{
"name": "SNACK BARS",
"price": "4.98"
},
// Additional items truncated for brevity
]
}
Processed in 6.15 seconds with no spatial data
{
"success": true,
"context": "Here are the details extracted from the receipt..."
"sections": [
{
"text": "See back of receipt for your chance\n
to win $1000 ID #: 7N5N1VIXCQDQ\n
Walmart\n 317-851-1102 Mgr:JAMIE BROOKSHIRE\n
882 S. STATE ROAD 135\nGREENWOOD IN 46143\n
ST# 05483 OP# 001436 TE# 09 TR# 06976\n
TATER TOTS\n001312000026 F 2.96 0\n...",
"lines": [
{
"text": "See back of receipt for your chance",
"bounds": {
"top_left": { "x": 185, "y": 63 },
"top_right": { "x": 459, "y": 76 },
"bottom_right": { "x": 459, "y": 93 },
"bottom_left": { "x": 184, "y": 84 },
"width": 274.5, "height": 19
},
"words": [
{
"text": "See",
"bounds": {
// Bounding box data for each word
}
},
// Additional words truncated
]
},
// Additional lines truncated
]
}
],
// Additional data truncated
}
Processed in 9.08 seconds with comprehensive positioning data
Accuracy: Good extraction of receipt data (total, tax, items)
Processing Time: 6.43 seconds
Output Quality: Clean JSON with essential receipt information
Accuracy: Similar extraction quality, but still no positioning data
Processing Time: 6.15 seconds
Output Quality: Failed to provide spatial data despite explicit request
Accuracy: Comprehensive text extraction from the receipt
Processing Time: 9.08 seconds
Output Quality: Complete text capture with detailed positioning information
Organization: Includes precise position data for each text element
Analysis: While Gemini OCR processed the receipt faster, JigsawStack vOCR delivered substantially more detail, including the exact position of each text element. Which is crucial for applications requiring spatial understanding of the document.
We evaluated a multilingual street sign containing Japanese characters and directional information.
Gemini with Unstructured Output:
[
"a",
"a",
"a",
"a",
"0.2 Km",
"alamy",
"四天王寺",
"alamy",
"a",
"庚申堂>",
"a",
"0.1km",
"a",
"竹本義太夫墓",
"●(超願寺内)。すぐ。",
"a",
"alamy",
"Image ID: CBDNR6",
"www.alamy.com",
"a",
"a",
"a"
]
Processed in 3.02 seconds - limited structure and context
Gemini With Structured Output and Detecting Bounding Boxes:
{
"success": true,
"context": {},
"width": 900,
"height": 1200,
"tags": [],
"has_text": true,
"sections": [
{
"text": "a\n
a\n
0.2 Km\n
alamy\n
四天王寺\n
alamy\n
a\n庚申堂>\n
a\n0.1km\n
a\n竹本義太夫墓\n
●(超願寺内)。すぐ。\n
a\nalamy\n
Image ID: CBDNR6\nwww.alamy.com",
"lines": [
{
"text": "a",
"bounds": {
"top_left": { "x": 19, "y": 19 },
"top_right": { "x": 34, "y": 19 },
"bottom_right": { "x": 34, "y": 29 },
"bottom_left": { "x": 19, "y": 29 },
"width": 15,
"height": 10
},
"words": [
{
"text": "a",
"bounds": {
"top_left": { "x": 19, "y": 19 },
"top_right": { "x": 34, "y": 19 },
"bottom_right": { "x": 34, "y": 29 },
"bottom_left": { "x": 19, "y": 29 },
"width": 15,
"height": 10
}
}
]
},
//Lines omitted for brevity
]
}
]
}
Processed in 30.67 seconds with spatial data
{
"success": true,
"context": "I'm unable to extract text from the image directly. However, I can help with general information or answer questions you might have!",
"width": 1300,
"height": 951,
"tags": [
"text", "screenshot", "rectangle", "font", "line",
"number", "signage", "colorfulness"
],
"has_text": true,
"sections": [
{
"text": "a\n四天王寺\n
a a a\n0.2Km\n
alamy\nalamy\n
a a\na\n
0. 1km\n
庚申心\n
02 a\n
alamy alamy\n
竹本義太夫墓\n
a\n
(超願寺内)。すぐ。\n
a\n
alamy\n
Image ID: CBDNR6\nwww.alamy.com",
"lines": [
{
"text": "a",
"bounds": {
"top_left": {
"x": 1089,
"y": 24
},
"top_right": {
"x": 1106,
"y": 24
},
"bottom_right": {
"x": 1106,
"y": 49
},
"bottom_left": {
"x": 1089,
"y": 49
},
"width": 17,
"height": 25
},
// Detailed line data with position information
]
}
]
}
Processed in 6.98 seconds
Accuracy: Successfully captured Japanese characters
Processing Time: 3.02 seconds
Output Quality: Limited structure and contextual information
Accuracy: Successfully captured Japanese characters
Processing Time: 30.67 seconds
Output Quality: Attempted to provide spatial data but with minimal context interpretation
Accuracy: Successfully recognized Japanese characters
Processing Time: 6.98 seconds (4.4× faster than Gemini's structured output)
Output Quality: Provided both structured data and a context summary
Organization: Included useful image tags (text, screenshot, rectangle, font, etc.)
Analysis: JigsawStack vOCR offered significantly better performance with comparable accuracy, processing the multilingual content more than four times faster than Gemini OCR's structured output while providing more meaningful context.
We evaluated both systems on a handwritten poem with cursive and stylized text.
Gemini with Unstructured Output:
[
"The lovely Seng night may soup hing shineg",
"Wensome and faranell my heart was beating",
"th rosehush on fre moor the violet beautiful",
"The artists, evening song new love new hiff",
"To behinja Holde bili marst se lang Inerell farewell",
"Non I leave this litle hunt where my beloved live",
"Walking now with wiled steps through the lenses",
"Luna shines throught busk and oak zephar per path",
"And the bich trees bowing how shed incense on the trade",
"How beautiful the coolness of this lovely summer night!",
"Hon the asl fills with happines in this tul place of quiet!",
"I can scarcely gross the bliss, jot Heaven I would shan",
"A thousand nights like this if my darling granted one."
]
Processed in 3.79 seconds - numerous transcription errors
Gemini With Structured Output with detecting bounding boxes:
{
//bounding boxes
"\"text\": \"The loure\\nWensome and faranell my heart\",\n
\"lines\": [\n {\n \"text\": \"The loure\",\n
\"bounds\": {\n \"top_left\": {\n \"x\": 58,\n \"y\": 48\n},\n
\"top_right\": {\n \"x\": 147,\n\"y\": 48\n },\n
\"bottom_right\": {\n \"x\": 147,\n\"y\": 73\n },\n
\"bottom_left\": {\n \"x\": 58,\n \"y\": 73\n},\n
\"width\": 89,\n \"height\": 25\n}"
"text\": \"th rosehush on fre the violet beautiful\\n
My\\nThe artists, evening song\\nnew life\\n"
//text output
"To behinja Holde bili marst se lang Inerell farewell\\n
Non I leave this litle hunt where my beloved live\\n
Walking now with wiled steps through the lenses\\n
Luna shines throught busk and oak zephar per path\\n
And the bich trees bowing how shed incense on the trade\\n
How beautiful the coolness of this lovely summer night!\\n
Hon the asl fills with happines in this tul place of quiet!\\n
I can scarcely scarcely gross the bliss, jot Heaven I would shan\\n
A thousand nights like this if my darling granted one."
}
Processed in 37.55 seconds - numerous transcription errors
{
"success": true,
"context": "The lovely Spring night may come when she shines\n
Welcome and farewell my heart was beating\n
the rosebank on the river the violet beautiful\n
The nights evening song we love new life\n
to be alive this must we leave farewell\n
Now I have this little hut where I heard him\n
Walking now with naked steps through the doors\n
when shines moonlight husk and oak zephyr perfake\n
And the nice trees towering overhead incense on the road\n
How beautiful the coolness of this lovely summer night!\n
Even the old fills with happiness in this true place of quiet!\n
I can scarcely grasp the bliss, yet Heaven, I would share\n
A thousand nights like this if my darling granted one.",
"width": 459,
"height": 360,
"tags": [
"text", "handwriting", "letter", "calligraphy",
"paper", "document", "font"
],
"has_text": true,
"sections": [
{
"text": "The lorey Seng night may comp ling stuing...",
"lines": [
// Detailed line-by-line data with bounding boxes
]
}
]
}
Processed in 7.16 seconds - better contextual understanding and structure preservation
Accuracy: Captured handwritten text with numerous transcription errors
Processing Time: 3.79 seconds
Output Quality: Basic text extraction without spatial context
Accuracy: Similar transcription errors as standard output
Processing Time: 37.55 seconds
Output Quality: Attempted to provide bounding boxes but with incomplete content
Accuracy: Better contextual understanding of handwritten content
Processing Time: 7.16 seconds (5× faster than Gemini's structured output)
Output Quality: Provided both raw text and a human-readable interpretation
Organization: Better preserved the meaning of the handwritten content
Analysis: JigsawStack vOCR demonstrated significantly better performance with improved accuracy, processing handwritten content about five times faster than Gemini OCR's structured output while delivering more contextually meaningful results.
We evaluated an invoice PDF with tabular data, company information, and financial details.
Gemini without prompt:
{
"invoice_number": "3299",
"invoice_date": "May 6, 2024",
"due_date": "May 17, 2024",
"po_number": "15",
"billing_address": "Futurelink Solutions\nKlausdalsbrovej 601\nBallerup 2750\nDenmark",
"company_address": "Sampleroad 14\nPostal 1410\nDenmark",
"company_name": "Demo Business Partner",
"balance_due": "€2,841.44",
"line_items": [
{
"item": "47500177- Ø0.6mm Drill Guide",
"quantity": "50",
"rate": "€2.50",
"amount": "€125.00"
},
// Additional line items truncated
],
"subtotal": "€2,257.15",
"tax": "€564.29",
"shipping": "€20.00",
"total": "€2,841.44",
"notes": "Please pay in due time",
"terms": "Terms of payment: Netto 10 days\nPlease transfer amount to account: Reg.nr. 1234 Konto nr. 0123456789\nWhen paying by bank transfer, please state invoice no."
}
Processed in 5.34 seconds - clean structured data
Gemini With Structured Output and Detecting Bounding Boxes:
{
"success": true,
"context": {},
"width": 792,
"height": 1122,
"tags": [],
"has_text": true,
"sections": [
{
"text": "Demo Business Partner\nSampleroad 14\nPostal 1410\nDenmark",
"lines": [
{
"text": "Demo Business Partner",
"bounds": {
"top_left": {"x": 130, "y": 24},
"top_right": {"x": 268, "y": 24},
"bottom_right": {"x": 268, "y": 37},
"bottom_left": {"x": 130, "y": 37},
"width": 137.859,
"height": 13.11
},
"words": [
{
"text": "Demo",
"bounds": {
"top_left": {"x": 130, "y": 24},
"top_right": {"x": 166, "y": 24},
"bottom_right": {"x": 166, "y": 37},
"bottom_left": {"x": 130, "y": 37},
"width": 35.922,
"height": 13.11
}
}
//Lines omitted for brevity
}
]
}
Processed in 38.17 seconds
{
"success": true,
"context": "```json\n{\n \"INVOICE\": {\n \"#\": \"3299\",\n
\"Date\": \"May 6, 2024\",\n \"Due Date\": \"May 17, 2024\",\n
\"PO Number\": \"15\",\n \"Balance Due\": \"€2,841.44\"\n },\n
\"Bill To\": {\n \"Company\": \"Futurelink Solutions\",\n
\"Address\": \"Klausdalsbrovej 601\\nBallerup 2750\\nDenmark\"\n },\n
\"Items\": [\n {\n \"Item\": \"47500177 - Ø0.6mm Drill Guide\",\n
\"Quantity\": \"50\",\n \"Rate\": \"€2.50\",\n
\"Amount\": \"€125.00\"\n },\n // Additional items truncated\n ],\n
\"Totals\": {\n \"Subtotal\": \"€2,257.15\",\n
\"Tax (25%)\": \"€564.29\",\n \"Shipping\": \"€20.00\",\n
\"Total\": \"€2,841.44\"\n },\n \"Notes\": [\n
\"Please pay in due time\"\n ],\n \"Terms\": [\n
\"Terms of payment: Netto 10 days\",\n
\"Please transfer amount to account: Reg.nr. 1234 Konto nr. 0123456789\",\n
\"When paying by bank transfer, please state invoice no.\"\n ]\n}\n```",
"total_pages": 1,
"width": 612,
"height": 792,
"tags": [
"text", "screenshot", "document", "font"
],
"has_text": true,
"sections": [
{
"text": "Demo Business Partner\nINVOICE\nSampleroad 14\n...",
"lines": [
// Detailed line data with positioning information
]
}
]
}
Processed in 6.39 seconds - structured with additional document metadata
Accuracy: Good extraction of the invoice text
Processing Time: 5.34 seconds
Output Quality: Clean structured data focused on business-relevant fields
Accuracy: Poor extraction when prompted for structured data with bounding boxes
Processing Time: 38.17 seconds
Output Quality: Only produced a few lines of data with coordinates
Accuracy: Excellent extraction with business context
Processing Time: 6.39 seconds (6× faster than Gemini's structured output)
Output Quality: Pre-structured JSON representation with items, totals, and metadata already parsed
Organization: Data organized hierarchically with clear section demarcation
Analysis: JigsawStack vOCR excelled with dramatically better speed and more useful structured output when compared to Gemini OCR's attempt at producing structured data with bounding boxes.
Speed Advantage: Gemini OCR consistently processed documents faster but provided less detailed output
Positioning Information: JigsawStack vOCR's inclusion of comprehensive bounding box data represents a significant advantage for applications requiring spatial understanding
Performance Balance: JigsawStack vOCR offered better overall performance with faster processing times and more useful output structures
Specialized Use Cases: Gemini OCR performed well for basic receipt processing but struggled with complex documents requiring spatial information
Handwriting Recognition: JigsawStack demonstrated superior capabilities with greater accuracy and context preservation
Structured Output: JigsawStack offered more flexibility in customizing extraction fields and maintaining document relationships
Multilingual Support: JigsawStack appeared to have broader language support based on documentation and testing
Our benchmarking reveals that while Gemini OCR offers impressive speed for basic extraction, JigsawStack vOCR provides a more comprehensive solution with superior positional data, handwriting recognition, and structural understanding. For applications requiring detailed document analysis rather than basic text extraction, JigsawStack vOCR demonstrates clear advantages.
The choice between these solutions ultimately depends on specific use case requirements:
If processing speed for simple text extraction is paramount, Gemini OCR may be preferable
If spatial understanding, handwriting recognition, or detailed document structure analysis is needed, JigsawStack vOCR offers superior capabilities
import { JigsawStack } from "jigsawstack";
const jigsawstack = JigsawStack({
apiKey: "your-api-key",
});
const result = await jigsawstack.vision.vocr({
prompt: ["total_price", "tax"],
url: "https://example.com/receipt.jpg",
});
console.log(result);
// Output example:
// {
// success: true,
// total_price: "144.02",
// tax: "4.58",
// // Additional context and metadata...
// }
from jigsawstack import JigsawStack
jigsawstack = JigsawStack(api_key="your-api-key")
result = jigsawstack.vision.vocr({
"url": "https://example.com/receipt.jpg",
"prompt": ["total_price", "tax"]
})
print(result)
# Similar structured output as JavaScript example
Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!