Gemini OCR vs JigsawStack vOCR

Share this article

Gemini OCR vs JigsawStack vOCR

Following our previous benchmarking of Mistral OCR, we now rigorously evaluate Google's Gemini OCR against JigsawStack vOCR. Both technologies boast strong text extraction capabilities, but which offers practical, reliable performance across diverse, real-world scenarios? We've conducted extensive testing across multilingual texts, structured PDFs, handwritten documents, and standard receipts to uncover their strengths and limitations.

What Makes a Great OCR Solution?

Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:

  • Multilingual text recognition capability

  • Ability to process both handwritten and printed text

  • Provision of precise bounding boxes for spatial positioning

  • Structured data extraction and formatting

  • Context understanding and intelligent interpretation

  • Consistency and accuracy across various document types

Ensuring a Fair Comparison

For fair comparison, Gemini OCR was given structured JSON prompts matching JigsawStack vOCR’s native output. Both OCR systems returned:

  • Bounding boxes (words, lines)

  • Structured text sections

  • Metadata (dimensions, tags)

This ensured comparability and clarity beyond default response differences.

Summary Comparison:

Gemini OCR vs. JigsawStack vOCR

◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds

FeatureGemini OCRJigsawStack vOCR
🌐 Multilingual SupportGood base coverage with standard processing times ◐Excellent support for 70+ languages with efficient processing ✅
📝 Handwriting RecognitionCaptures basic handwritten content ❌Strong accuracy with contextual interpretation capabilities ✅
Bounding BoxesProvides coordinate data for identified text ◐Detailed positioning with comprehensive width/height measurements ✅
📁 Structured OutputLimited, provided basic text extraction with some formatting ❌Rich hierarchical structure with semantic and spatial integration ✅
Processing SpeedVariable processing times (11-42 seconds) ❌Consistently faster processing (12-32 seconds) ✅
🧠 Context UnderstandingIdentifies document types and basic structure ◐Preserves relationships between elements with dual-layer analysis ✅
📕 Complex Document HandlingHandles standard files; may face token limits or parsing issues with complex content ❌Excels with intricate layouts and maintains structure integrity ✅

Benchmarking Methodology

We tested Gemini OCR and JigsawStack vOCR using four document types:

  1. Receipt Processing

  2. Multilingual Signage

  3. Handwritten Text

  4. Structured PDFs.

Interactive Testing

You can run these tests yourself using this Google Colab Notebook.

Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:

Structured output (JSON)

Benchmarking Setup

Before we begin, let’s setup the environment that showcases the comparison for which we shall download the test files:

Python

Setting up Gemini for vOCR

Setting up Gemini to give structured output is fairly straight forward as part of this benchmark setup we shall be using the following response schema:

Python

Results

Test 1: Receipt Processing

We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.
View the full response here.

Response - Gemini OCR:

Processed in 41 seconds

Response - JigsawStack vOCR:

Processed in 16 seconds

Gemini OCR Performance

  • Accuracy: Good extraction of full receipt text with complete transaction details

  • Processing Time: 41 seconds

  • Output Quality: Basic text extraction with spatial coordinates and detailed structure

  • Organization: Simple text format with minimal structure; no word-level breakdown or contextual grouping

JigsawStack vOCR Performance

  • Accuracy: Complete extraction with precise word-level detail and comprehensive contextual data

  • Processing Time: 16 seconds

  • Output Quality: Rich structured data with multiple representations (raw text, itemized entries, financial summaries)

  • Organization: Sophisticated hierarchical structure with sections, lines, words, and precise spatial coordinates for each element

Test 2: Multilingual Text Recognition

Part 1. We evaluated a multilingual street sign containing Japanese characters and directional information.
View the full response here.

Response - Gemini OCR:

Processed in 38 seconds

Response - JigsawStack vOCR:

Processed in 16 seconds

Gemini OCR

  • Processing: 38 seconds

  • Character Recognition: Successfully identifies major Japanese locations (四天王寺, 庚申堂, 竹本義太夫墓)

  • Metadata: Provides helpful tags like "sign", "Japanese", "direction", "distance"

JigsawStack vOCR

  • Processing: 12 seconds

  • Visual Context: Includes helpful details like "Yellow background with blue text" and identifies directional arrows

  • Direction Indicators: Preserves the relationship between text and arrows (e.g., "← 四天王寺" and "竹本義太夫墓 →")

  • Symbol Formatting: Maintains proper Japanese formatting with correct parentheses styles

  • Structure: Organizes content in a logical, consistent pattern

Part 2. We evaluated a multilingual learning example containing English & Telugu

Response - Gemini OCR :

Processed in 9 seconds

Response - JigsawStack vOCR:

Processed in 14 seconds

Gemini OCR

  • Processing Time: 9 seconds

  • Layout Detection: Successfully captures document structure with accurate bounding box coordinates

  • English Recognition: Correctly recognizes all English words and transliterations in parentheses (athadu, aame, etc.)

  • Document Context: Identifies the content as language learning material with translations

JigsawStack vOCR

  • Processing Time: 14 seconds

  • Telugu Script Handling: Provides proper Unicode encoding for Telugu characters (అతడు, ఆమె, అబ్బాయి, etc.) in the context section

  • Document Structure: Features dual-layer recognition with separate context and raw recognition layers

  • Format Analysis: Includes detailed information about text color, alignment, and list formatting

  • Layout Precision: Provides comprehensive bounding box coordinates with width/height measurements

  • System Metrics: Delivers helpful token usage statistics for optimization

Test 3: Handwritten Text Recognition

We evaluated both systems on a handwritten poem with cursive and stylized text.

Response - Gemini OCR :

Processed in 40 seconds

Response - JigsawStack vOCR

Processed in 40 seconds

Gemini OCR Performance

  • Processing Time: 42 seconds

  • Accuracy: Handles the handwritten words successfully, with occasional variations in challenging text portions

  • Output Quality: Provides basic bounding box coordinates

  • Contextual Limitations: Difficulties differentiate between phrases and meaningful content, e.g., "stuing," "mited steps," "floy Alis"

JigsawStack vOCR Performance

  • Processing Time: 32 seconds

  • Accuracy: Demonstrates good contextual understanding with intelligent interpretation of handwritten content

  • Output Quality: Delivers complete JSON with both raw text recognition and enhanced interpretation

  • Linguistic Intelligence: Reconstructs likely intended phrases like "How the soul fills with happiness" instead of "Hon the sl fills with happines"

Test 4: Structured Document (PDF) Processing

We evaluated a 15 page PDF: https://arxiv.org/pdf/2406.04692

Response - Gemini OCR:

Processed in 43 seconds

Response - JigsawStack vOCR:

Processed in 37 seconds

Gemini OCR Performance

  • Processing Time: 43 seconds

  • Accuracy: Incomplete extraction due to running out of tokens, processing only a fraction of the document

  • Output Quality: Limited to extracting metadata and first page elements before encountering errors

  • Coordinate Precision: High precision for elements it processed but failed to maintain throughout

  • Reliability: Encountered processing limitations leading to incomplete output

JigsawStack vOCR Performance

  • Processing Time: 37 seconds

  • Accuracy: Comprehensive extraction of all 15 pages with complete contextual information

  • Output Quality: Well-structured JSON with hierarchical organization of document elements

  • Coordinate Precision: Detailed bounding box coordinates with width/height measurements for every text element

  • Reliability: Successfully processed over 350,000 tokens of content with no degradation

Key Findings

  • Processing Efficiency: JigsawStack vOCR consistently delivers faster processing times across various document types while maintaining high-quality results

  • Structured Data Organization: JigsawStack provides comprehensive output structures with hierarchical formatting that makes information immediately actionable

  • Multilingual Capabilities: JigsawStack shows particular strength in handling non-Latin scripts like Japanese and Telugu with proper Unicode encoding

  • Contextual Understanding: JigsawStack offers dual-layer recognition that provides both raw text and enhanced interpretations for challenging content

  • Document Intelligence: JigsawStack includes valuable metadata about document formatting, language detection, and visual presentation

Conclusion

Our benchmarking shows that both systems offer effective OCR capabilities with different strengths. Gemini OCR provides good basic text recognition with solid performance for straightforward content. JigsawStack vOCR delivers enhanced functionality through its structured output formats, superior multilingual support, and comprehensive document analysis.

Run these tests yourself here: Google Colab Notebook.

Gemini OCR vs. JigsawStack vOCR

◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds

FeatureGemini OCRJigsawStack vOCR
🌐 Multilingual SupportGood base coverage with standard processing times ◐Excellent support for 70+ languages with efficient processing ✅
📝 Handwriting RecognitionCaptures basic handwritten content ❌Strong accuracy with contextual interpretation capabilities ✅
Bounding BoxesProvides coordinate data for identified text ◐Detailed positioning with comprehensive width/height measurements ✅
📁 Structured OutputLimited, provided basic text extraction with some formatting ❌Rich hierarchical structure with semantic and spatial integration ✅
Processing SpeedVariable processing times (11-42 seconds) ❌Consistently faster processing (12-32 seconds) ✅
🧠 Context UnderstandingIdentifies document types and basic structure ◐Preserves relationships between elements with dual-layer analysis ✅
📕 Complex Document HandlingHandles standard files; may face token limits or parsing issues with complex content ❌Excels with intricate layouts and maintains structure integrity ✅

Recommended Use:

  • Basic Text Recognition: Both systems perform effectively for simple English text extraction

  • Detailed Document Analysis: JigsawStack vOCR offers more comprehensive structured data with spatial positioning and formatting details

  • Multilingual Processing: JigsawStack demonstrates notable advantages for non-Latin scripts and complex language handling

  • Time-Sensitive Applications: JigsawStack's consistently faster processing times provide efficiency benefits for high-volume document processing

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article