Beta

Gemini OCR vs JigsawStack vOCR

20 May 2025

Angel Pichardo

Share this article

Following our previous benchmarking of Mistral OCR, we now rigorously evaluate Google's Gemini OCR against JigsawStack vOCR. Both technologies boast strong text extraction capabilities, but which offers practical, reliable performance across diverse, real-world scenarios? We've conducted extensive testing across multilingual texts, structured PDFs, handwritten documents, and standard receipts to uncover their strengths and limitations.

What Makes a Great OCR Solution?

Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:

Multilingual text recognition capability
Ability to process both handwritten and printed text
Provision of precise bounding boxes for spatial positioning
Structured data extraction and formatting
Context understanding and intelligent interpretation
Consistency and accuracy across various document types

Ensuring a Fair Comparison

For fair comparison, Gemini OCR was given structured JSON prompts matching JigsawStack vOCR’s native output. Both OCR systems returned:

Bounding boxes (words, lines)
Structured text sections
Metadata (dimensions, tags)

This ensured comparability and clarity beyond default response differences.

Summary Comparison:

Gemini OCR vs. JigsawStack vOCR

◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds

Feature	Gemini OCR	JigsawStack vOCR
🌐 Multilingual Support	Good base coverage with standard processing times ◐	Excellent support for 70+ languages with efficient processing ✅
📝 Handwriting Recognition	Captures basic handwritten content ❌	Strong accuracy with contextual interpretation capabilities ✅
⊞ Bounding Boxes	Provides coordinate data for identified text ◐	Detailed positioning with comprehensive width/height measurements ✅
📁 Structured Output	Limited, provided basic text extraction with some formatting ❌	Rich hierarchical structure with semantic and spatial integration ✅
⚡ Processing Speed	Variable processing times (11-42 seconds) ❌	Consistently faster processing (12-32 seconds) ✅
🧠 Context Understanding	Identifies document types and basic structure ◐	Preserves relationships between elements with dual-layer analysis ✅
📕 Complex Document Handling	Handles standard files; may face token limits or parsing issues with complex content ❌	Excels with intricate layouts and maintains structure integrity ✅

Benchmarking Methodology

We tested Gemini OCR and JigsawStack vOCR using four document types:

Receipt Processing
Multilingual Signage
Handwritten Text
Structured PDFs.

Interactive Testing

You can run these tests yourself using this Google Colab Notebook.

Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:

Structured output (JSON)

...

Benchmarking Setup

Before we begin, let’s setup the environment that showcases the comparison for which we shall download the test files:

Python

...

Setting up Gemini for vOCR

Setting up Gemini to give structured output is fairly straight forward as part of this benchmark setup we shall be using the following response schema:

Python

...

Results

Test 1: Receipt Processing

We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.
View the full response here.

Response - Gemini OCR:

...

Processed in 41 seconds

Response - JigsawStack vOCR:

...

Processed in 16 seconds

Gemini OCR Performance

Accuracy: Good extraction of full receipt text with complete transaction details
Processing Time: 41 seconds
Output Quality: Basic text extraction with spatial coordinates and detailed structure
Organization: Simple text format with minimal structure; no word-level breakdown or contextual grouping

JigsawStack vOCR Performance

Accuracy: Complete extraction with precise word-level detail and comprehensive contextual data
Processing Time: 16 seconds
Output Quality: Rich structured data with multiple representations (raw text, itemized entries, financial summaries)
Organization: Sophisticated hierarchical structure with sections, lines, words, and precise spatial coordinates for each element

Test 2: Multilingual Text Recognition

Part 1. We evaluated a multilingual street sign containing Japanese characters and directional information.
View the full response here.

Response - Gemini OCR:

...

Processed in 38 seconds

Response - JigsawStack vOCR:

...

Processed in 16 seconds

Gemini OCR

Processing: 38 seconds
Character Recognition: Successfully identifies major Japanese locations (四天王寺, 庚申堂, 竹本義太夫墓)
Metadata: Provides helpful tags like "sign", "Japanese", "direction", "distance"

JigsawStack vOCR

Processing: 12 seconds
Visual Context: Includes helpful details like "Yellow background with blue text" and identifies directional arrows
Direction Indicators: Preserves the relationship between text and arrows (e.g., "← 四天王寺" and "竹本義太夫墓 →")
Symbol Formatting: Maintains proper Japanese formatting with correct parentheses styles
Structure: Organizes content in a logical, consistent pattern

Part 2. We evaluated a multilingual learning example containing English & Telugu

Response - Gemini OCR :

...

Processed in 9 seconds

Response - JigsawStack vOCR:

...

Processed in 14 seconds

Gemini OCR

Processing Time: 9 seconds
Layout Detection: Successfully captures document structure with accurate bounding box coordinates
English Recognition: Correctly recognizes all English words and transliterations in parentheses (athadu, aame, etc.)
Document Context: Identifies the content as language learning material with translations

JigsawStack vOCR

Processing Time: 14 seconds
Telugu Script Handling: Provides proper Unicode encoding for Telugu characters (అతడు, ఆమె, అబ్బాయి, etc.) in the context section
Document Structure: Features dual-layer recognition with separate context and raw recognition layers
Format Analysis: Includes detailed information about text color, alignment, and list formatting
Layout Precision: Provides comprehensive bounding box coordinates with width/height measurements
System Metrics: Delivers helpful token usage statistics for optimization

Test 3: Handwritten Text Recognition

We evaluated both systems on a handwritten poem with cursive and stylized text.

Response - Gemini OCR :

...

Processed in 40 seconds

Response - JigsawStack vOCR

...

Processed in 40 seconds

Gemini OCR Performance

Processing Time: 42 seconds
Accuracy: Handles the handwritten words successfully, with occasional variations in challenging text portions
Output Quality: Provides basic bounding box coordinates
Contextual Limitations: Difficulties differentiate between phrases and meaningful content, e.g., "stuing," "mited steps," "floy Alis"

JigsawStack vOCR Performance

Processing Time: 32 seconds
Accuracy: Demonstrates good contextual understanding with intelligent interpretation of handwritten content
Output Quality: Delivers complete JSON with both raw text recognition and enhanced interpretation
Linguistic Intelligence: Reconstructs likely intended phrases like "How the soul fills with happiness" instead of "Hon the sl fills with happines"

Test 4: Structured Document (PDF) Processing

We evaluated a 15 page PDF: https://arxiv.org/pdf/2406.04692

Response - Gemini OCR:

...

Processed in 43 seconds

Response - JigsawStack vOCR:

...

Processed in 37 seconds

Gemini OCR Performance

Processing Time: 43 seconds
Accuracy: Incomplete extraction due to running out of tokens, processing only a fraction of the document
Output Quality: Limited to extracting metadata and first page elements before encountering errors
Coordinate Precision: High precision for elements it processed but failed to maintain throughout
Reliability: Encountered processing limitations leading to incomplete output

JigsawStack vOCR Performance

Processing Time: 37 seconds
Accuracy: Comprehensive extraction of all 15 pages with complete contextual information
Output Quality: Well-structured JSON with hierarchical organization of document elements
Coordinate Precision: Detailed bounding box coordinates with width/height measurements for every text element
Reliability: Successfully processed over 350,000 tokens of content with no degradation

Key Findings

Processing Efficiency: JigsawStack vOCR consistently delivers faster processing times across various document types while maintaining high-quality results
Structured Data Organization: JigsawStack provides comprehensive output structures with hierarchical formatting that makes information immediately actionable
Multilingual Capabilities: JigsawStack shows particular strength in handling non-Latin scripts like Japanese and Telugu with proper Unicode encoding
Contextual Understanding: JigsawStack offers dual-layer recognition that provides both raw text and enhanced interpretations for challenging content
Document Intelligence: JigsawStack includes valuable metadata about document formatting, language detection, and visual presentation

Conclusion

Our benchmarking shows that both systems offer effective OCR capabilities with different strengths. Gemini OCR provides good basic text recognition with solid performance for straightforward content. JigsawStack vOCR delivers enhanced functionality through its structured output formats, superior multilingual support, and comprehensive document analysis.

Run these tests yourself here: Google Colab Notebook.

Gemini OCR vs. JigsawStack vOCR

◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds

Feature	Gemini OCR	JigsawStack vOCR
🌐 Multilingual Support	Good base coverage with standard processing times ◐	Excellent support for 70+ languages with efficient processing ✅
📝 Handwriting Recognition	Captures basic handwritten content ❌	Strong accuracy with contextual interpretation capabilities ✅
⊞ Bounding Boxes	Provides coordinate data for identified text ◐	Detailed positioning with comprehensive width/height measurements ✅
📁 Structured Output	Limited, provided basic text extraction with some formatting ❌	Rich hierarchical structure with semantic and spatial integration ✅
⚡ Processing Speed	Variable processing times (11-42 seconds) ❌	Consistently faster processing (12-32 seconds) ✅
🧠 Context Understanding	Identifies document types and basic structure ◐	Preserves relationships between elements with dual-layer analysis ✅
📕 Complex Document Handling	Handles standard files; may face token limits or parsing issues with complex content ❌	Excels with intricate layouts and maintains structure integrity ✅

Recommended Use:

Basic Text Recognition: Both systems perform effectively for simple English text extraction
Detailed Document Analysis: JigsawStack vOCR offers more comprehensive structured data with spatial positioning and formatting details
Multilingual Processing: JigsawStack demonstrates notable advantages for non-Latin scripts and complex language handling
Time-Sensitive Applications: JigsawStack's consistently faster processing times provide efficiency benefits for high-volume document processing

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article