20 May 2025
Following our previous benchmarking of Mistral OCR, we now rigorously evaluate Google's Gemini OCR against JigsawStack vOCR. Both technologies boast strong text extraction capabilities, but which offers practical, reliable performance across diverse, real-world scenarios? We've conducted extensive testing across multilingual texts, structured PDFs, handwritten documents, and standard receipts to uncover their strengths and limitations.
Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:
Multilingual text recognition capability
Ability to process both handwritten and printed text
Provision of precise bounding boxes for spatial positioning
Structured data extraction and formatting
Context understanding and intelligent interpretation
Consistency and accuracy across various document types
For fair comparison, Gemini OCR was given structured JSON prompts matching JigsawStack vOCR’s native output. Both OCR systems returned:
Bounding boxes (words, lines)
Structured text sections
Metadata (dimensions, tags)
This ensured comparability and clarity beyond default response differences.
◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds
Feature | Gemini OCR | JigsawStack vOCR |
---|---|---|
🌐 Multilingual Support | Good base coverage with standard processing times ◐ | Excellent support for 70+ languages with efficient processing ✅ |
📝 Handwriting Recognition | Captures basic handwritten content ❌ | Strong accuracy with contextual interpretation capabilities ✅ |
⊞ Bounding Boxes | Provides coordinate data for identified text ◐ | Detailed positioning with comprehensive width/height measurements ✅ |
📁 Structured Output | Limited, provided basic text extraction with some formatting ❌ | Rich hierarchical structure with semantic and spatial integration ✅ |
⚡ Processing Speed | Variable processing times (11-42 seconds) ❌ | Consistently faster processing (12-32 seconds) ✅ |
🧠 Context Understanding | Identifies document types and basic structure ◐ | Preserves relationships between elements with dual-layer analysis ✅ |
📕 Complex Document Handling | Handles standard files; may face token limits or parsing issues with complex content ❌ | Excels with intricate layouts and maintains structure integrity ✅ |
We tested Gemini OCR and JigsawStack vOCR using four document types:
Receipt Processing
Multilingual Signage
Handwritten Text
Structured PDFs.
You can run these tests yourself using this Google Colab Notebook.
Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:
Structured output (JSON)
Before we begin, let’s setup the environment that showcases the comparison for which we shall download the test files:
Python
Setting up Gemini to give structured output is fairly straight forward as part of this benchmark setup we shall be using the following response schema:
Python
We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.
View the full response here.
Response - Gemini OCR:
Processed in 41 seconds
Response - JigsawStack vOCR:
Processed in 16 seconds
Accuracy: Good extraction of full receipt text with complete transaction details
Processing Time: 41 seconds
Output Quality: Basic text extraction with spatial coordinates and detailed structure
Organization: Simple text format with minimal structure; no word-level breakdown or contextual grouping
Accuracy: Complete extraction with precise word-level detail and comprehensive contextual data
Processing Time: 16 seconds
Output Quality: Rich structured data with multiple representations (raw text, itemized entries, financial summaries)
Organization: Sophisticated hierarchical structure with sections, lines, words, and precise spatial coordinates for each element
Part 1. We evaluated a multilingual street sign containing Japanese characters and directional information.
View the full response here.
Response - Gemini OCR:
Processed in 38 seconds
Response - JigsawStack vOCR:
Processed in 16 seconds
Processing: 38 seconds
Character Recognition: Successfully identifies major Japanese locations (四天王寺, 庚申堂, 竹本義太夫墓)
Metadata: Provides helpful tags like "sign", "Japanese", "direction", "distance"
Processing: 12 seconds
Visual Context: Includes helpful details like "Yellow background with blue text" and identifies directional arrows
Direction Indicators: Preserves the relationship between text and arrows (e.g., "← 四天王寺" and "竹本義太夫墓 →")
Symbol Formatting: Maintains proper Japanese formatting with correct parentheses styles
Structure: Organizes content in a logical, consistent pattern
Part 2. We evaluated a multilingual learning example containing English & Telugu
Response - Gemini OCR :
Processed in 9 seconds
Response - JigsawStack vOCR:
Processed in 14 seconds
Processing Time: 9 seconds
Layout Detection: Successfully captures document structure with accurate bounding box coordinates
English Recognition: Correctly recognizes all English words and transliterations in parentheses (athadu, aame, etc.)
Document Context: Identifies the content as language learning material with translations
Processing Time: 14 seconds
Telugu Script Handling: Provides proper Unicode encoding for Telugu characters (అతడు, ఆమె, అబ్బాయి, etc.) in the context section
Document Structure: Features dual-layer recognition with separate context and raw recognition layers
Format Analysis: Includes detailed information about text color, alignment, and list formatting
Layout Precision: Provides comprehensive bounding box coordinates with width/height measurements
System Metrics: Delivers helpful token usage statistics for optimization
We evaluated both systems on a handwritten poem with cursive and stylized text.
Response - Gemini OCR :
Processed in 40 seconds
Response - JigsawStack vOCR
Processed in 40 seconds
Processing Time: 42 seconds
Accuracy: Handles the handwritten words successfully, with occasional variations in challenging text portions
Output Quality: Provides basic bounding box coordinates
Contextual Limitations: Difficulties differentiate between phrases and meaningful content, e.g., "stuing," "mited steps," "floy Alis"
Processing Time: 32 seconds
Accuracy: Demonstrates good contextual understanding with intelligent interpretation of handwritten content
Output Quality: Delivers complete JSON with both raw text recognition and enhanced interpretation
Linguistic Intelligence: Reconstructs likely intended phrases like "How the soul fills with happiness" instead of "Hon the sl fills with happines"
We evaluated a 15 page PDF: https://arxiv.org/pdf/2406.04692
Response - Gemini OCR:
Processed in 43 seconds
Response - JigsawStack vOCR:
Processed in 37 seconds
Processing Time: 43 seconds
Accuracy: Incomplete extraction due to running out of tokens, processing only a fraction of the document
Output Quality: Limited to extracting metadata and first page elements before encountering errors
Coordinate Precision: High precision for elements it processed but failed to maintain throughout
Reliability: Encountered processing limitations leading to incomplete output
Processing Time: 37 seconds
Accuracy: Comprehensive extraction of all 15 pages with complete contextual information
Output Quality: Well-structured JSON with hierarchical organization of document elements
Coordinate Precision: Detailed bounding box coordinates with width/height measurements for every text element
Reliability: Successfully processed over 350,000 tokens of content with no degradation
Processing Efficiency: JigsawStack vOCR consistently delivers faster processing times across various document types while maintaining high-quality results
Structured Data Organization: JigsawStack provides comprehensive output structures with hierarchical formatting that makes information immediately actionable
Multilingual Capabilities: JigsawStack shows particular strength in handling non-Latin scripts like Japanese and Telugu with proper Unicode encoding
Contextual Understanding: JigsawStack offers dual-layer recognition that provides both raw text and enhanced interpretations for challenging content
Document Intelligence: JigsawStack includes valuable metadata about document formatting, language detection, and visual presentation
Our benchmarking shows that both systems offer effective OCR capabilities with different strengths. Gemini OCR provides good basic text recognition with solid performance for straightforward content. JigsawStack vOCR delivers enhanced functionality through its structured output formats, superior multilingual support, and comprehensive document analysis.
Run these tests yourself here: Google Colab Notebook.
◐ = partial ❌ = inaccurate/fails ✅ = accurate/succeeds
Feature | Gemini OCR | JigsawStack vOCR |
---|---|---|
🌐 Multilingual Support | Good base coverage with standard processing times ◐ | Excellent support for 70+ languages with efficient processing ✅ |
📝 Handwriting Recognition | Captures basic handwritten content ❌ | Strong accuracy with contextual interpretation capabilities ✅ |
⊞ Bounding Boxes | Provides coordinate data for identified text ◐ | Detailed positioning with comprehensive width/height measurements ✅ |
📁 Structured Output | Limited, provided basic text extraction with some formatting ❌ | Rich hierarchical structure with semantic and spatial integration ✅ |
⚡ Processing Speed | Variable processing times (11-42 seconds) ❌ | Consistently faster processing (12-32 seconds) ✅ |
🧠 Context Understanding | Identifies document types and basic structure ◐ | Preserves relationships between elements with dual-layer analysis ✅ |
📕 Complex Document Handling | Handles standard files; may face token limits or parsing issues with complex content ❌ | Excels with intricate layouts and maintains structure integrity ✅ |
Basic Text Recognition: Both systems perform effectively for simple English text extraction
Detailed Document Analysis: JigsawStack vOCR offers more comprehensive structured data with spatial positioning and formatting details
Multilingual Processing: JigsawStack demonstrates notable advantages for non-Latin scripts and complex language handling
Time-Sensitive Applications: JigsawStack's consistently faster processing times provide efficiency benefits for high-volume document processing
Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!