JigsawStack Logo

Beta

Gemini OCR vs JigsawStack vOCR

Share this article

Gemini OCR vs JigsawStack vOCR

Following the footsteps of our previous benchmarking of Mistral OCR, we now put Google's Gemini OCR to the test against JigsawStack vOCR. With both technologies claiming robust capabilities in text extraction, we've conducted real-world tests across a variety of document types and languages to determine which solution performs better in practical applications.

What Makes a Great OCR Solution?

Before diving into the results, let's remember the key factors that differentiate exceptional OCR solutions from basic ones:

  • Multilingual text recognition capability

  • Ability to process both handwritten and printed text

  • Provision of precise bounding boxes for spatial positioning

  • Structured data extraction and formatting

  • Context understanding and intelligent interpretation

  • Consistency and accuracy across various document types

Ensuring a Fair Comparison

To compare the systems fairly, we provided Gemini OCR with a structured JSON prompt that closely mirrored JigsawStack vOCR’s native output format. This prompt instructed Gemini OCR to return:

  • Bounding boxes for words and lines

  • Structured text sections

  • Metadata, including document dimensions and detected tags

By enforcing this format, we ensured that both OCR solutions returned comparable outputs, making it easier to assess their real capabilities beyond differences in default response formatting.

Summary Comparison:

Standard Outputs

FeatureGemini OCR (Standard)JigsawStack vOCR
Multilingual supportModerate support, struggles with certain Asian languagesStrong support across 70+ languages including less common ones ✅
Handwriting recognitionBasic recognition with moderate accuracySuperior accuracy with context preservation ✅
Bounding boxesNot provided in standard outputComprehensive word and line-level positioning data ✅
Structured outputGood JSON formatting but requires post-processingNative structured output with customizable fields ✅
Processing speedFast processing (3-6 seconds) ✅Slightly slower (5-9 seconds)
Context understandingGood basic extraction but lacks semantic understandingBetter preservation of document context and relationships ✅

Gemini Structured Output Request vs. JigsawStack

FeatureGemini OCR (Structured Output)JigsawStack vOCR
Multilingual supportModerate support but significantly slower processingStrong support across 70+ languages with consistent performance ✅
Handwriting recognitionSignificantly degraded accuracy with bounding box requestsSuperior accuracy maintained ✅
Bounding boxesIncomplete or inaccurate positioning data with high latencyComprehensive word and line-level positioning data with reasonable speed ✅
Structured outputStruggles with combined spatial and semantic dataNative structured output with semantic understanding ✅
Processing speedExtremely slow (30-38 seconds) when spatial data requestedConsistent performance (5-9 seconds) ✅
Context understandingFurther degraded when forced into structured formatMaintains better preservation contextual relationships ✅

Benchmarking Methodology

We tested Gemini OCR and JigsawStack vOCR using four document types:

  1. Receipt Processing – Extracting totals, taxes, and itemized entries.

  2. Multilingual Recognition – Handling mixed-language street signs.

  3. Handwritten Text Recognition – Transcribing cursive and stylized handwriting.

  4. Structured Document Processing (PDFs) – Extracting tabular data and financial details.

Each OCR system received identical image inputs along with the corresponding prompt. We also structured the Gemini OCR request to ensure its output matched JigsawStack vOCR’s native response format using the following JSON schema:

{
  "success": true,
  "context": {},
  "width": 1000,
  "height": 750,
  "tags": ["text", "document"],
  "has_text": true,
  "sections": [
    {
      "text": "Extracted text here",
      "lines": [
        {
          "text": "Line text",
          "bounds": {
            "top_left": { "x": 100, "y": 50 },
            "bottom_right": { "x": 300, "y": 70 }
          },
          "words": [
            {
              "text": "Word",
              "bounds": {
                "top_left": { "x": 110, "y": 55 },
                "bottom_right": { "x": 140, "y": 65 }
              }
            }
          ]
        }
      ]
    }
  ]
}

Benchmarking Setup

import os
import json
import time
import google.generativeai as genai
from concurrent.futures import ThreadPoolExecutor
from jigsawstack import JigsawStack, JigsawStackError
from google.generativeai.types import Content, GenerateContentConfig, Part

# Define test images and prompts
TEST_IMAGES = [
    "test_files/sample_receipt.jpg",
    "test_files/sample_handwriting.jpg",
    "test_files/sample_multilingual.jpg",
    "test_files/sample_pdf.pdf",
]

PROMPTS = {
    "test_files/sample_receipt.jpg": "Extract the total price, tax, and all itemized entries from this receipt.",
    "test_files/sample_handwriting.jpg": "Transcribe all handwritten text from this image, ensuring accuracy in cursive and print styles.",
    "test_files/sample_multilingual.jpg": "Extract all text from this image, identifying different languages and preserving formatting.",
    "test_files/sample_pdf.pdf": "Extract all structured text and maintain the document's section hierarchy from this PDF."
}

OUTPUT_FOLDER = "benchmark_results"
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

Gemini OCR Request Function

def generate(
        file: str = "sample_receipt.jpg",
        model: str = "gemini-2.0-flash",
        prompt: list = ["Extract the total price, tax, and all itemized entries from this receipt."],
):
    client = genai.Client(api_key=GEMINI_API_KEY)

    files = [
        client.files.upload(file=file),
    ]
    
    contents = [
        Content(
            role="user",
            parts=[
                Part.from_uri(
                    file_uri=files[0].uri,
                    mime_type=files[0].mime_type,
                ),
                Part.from_text(text="Based on the given file and its contents perform vOCR to obtain results for each of the following:\n" + "\n".join(prompt)),
            ],
        ),
    ]

    generate_content_config = GenerateContentConfig(
        temperature=0.6,
        top_p=0.95,
        top_k=40,
        max_output_tokens=8192,
        response_mime_type="application/json",
    )

    content = client.models.generate_content(
        model=model,
        contents=contents,
        config=generate_content_config
    )

    return content

Calling Gemini OCR

def call_gemini_ocr(image_path):
    """Uses Gemini Pro Vision for OCR with optimized prompts"""
    if not os.path.exists(image_path):
        print(f"Error: File {image_path} not found.")
        return None, None

    try:
        with open(image_path, "rb") as img_file:
            image_data = img_file.read()

        prompt = PROMPTS.get(image_path, "Extract all text from this image.")

        start_time = time.perf_counter()
        response = generate(file=image_path, prompt=[prompt])
        latency = time.perf_counter() - start_time

        if not response or not hasattr(response, "text"):
            print(f"Warning: Gemini API returned unexpected response for {image_path}")
            return None, latency

        return response.text, latency
    except Exception as e:
        print(f"Gemini API Error: {e}")
        return None, None

Calling JigsawStack vOCR

def upload_to_jigsawstack(image_path):
    """Uploads file to JigsawStack File Storage and returns file_store_key"""
    if not os.path.exists(image_path):
        print(f"Error: File {image_path} not found.")
        return None

    jigsawstack = JigsawStack(api_key=JIGSAWSTACK_API_KEY)

    try:
        with open(image_path, "rb") as image_file:
            image_data = image_file.read()
            result = jigsawstack.store.upload(
                image_data, {"filename": os.path.basename(image_path), "overwrite": True}
            )
        result = result.json()
        file_key = result['key']

        if not file_key:
            print(f"Error: JigsawStack did not return a valid key for {image_path}")
            return None

        return file_key
    except JigsawStackError as err:
        print(f"Error uploading {image_path} to JigsawStack: {err}")
        return None
    except Exception as e:
        print(f"Unexpected error during JigsawStack upload: {e}")
        return None


def call_jigsawstack_vocr(image_path):
    """Calls JigsawStack vOCR using file_store_key and optimized prompts"""
    jigsawstack = JigsawStack(api_key=JIGSAWSTACK_API_KEY)

    file_store_key = upload_to_jigsawstack(image_path)
    if not file_store_key:
        print(f"Skipping JigsawStack vOCR for {image_path} due to upload failure")
        return None, None

    prompt = PROMPTS.get(image_path, "Describe the image in detail.")

    start_time = time.perf_counter()
    try:
        result = jigsawstack.vision.vocr({"file_store_key": file_store_key, "prompt": prompt})
        result = result.json()
        latency = time.perf_counter() - start_time
        return result, latency
    except JigsawStackError as err:
        print(f"Error processing {image_path} with vOCR: {err}")
        return str(err), None
    except Exception as e:
        print(f"Unexpected error during JigsawStack vOCR: {e}")
        return str(e), None

Processing Images in Parallel

def process_image(image):
    """Runs OCR on a single image using both APIs"""
    print(f"Processing {image}...")

    gemini_result, gemini_latency = call_gemini_ocr(image)
    jigsawstack_result, jigsawstack_latency = call_jigsawstack_vocr(image)

    save_results(image, gemini_result, gemini_latency, jigsawstack_result, jigsawstack_latency)

def run_benchmark():
    """Runs benchmark tests in parallel"""
    try:
        with ThreadPoolExecutor() as executor:
            executor.map(process_image, TEST_IMAGES)
        print("Benchmarking complete. Results saved in", OUTPUT_FOLDER)
    except Exception as e:
        print(f"Error during benchmark execution: {e}")

if __name__ == "__main__":
    try:
        run_benchmark()
    except Exception as e:
        print(f"Fatal error in benchmark script: {e}")

Test 1: Receipt Processing

We evaluated both systems on a standard Walmart receipt containing multiple line items, taxes, and totals.

Response - Gemini OCR

Gemini with Unstructured Output:

{
  "total": "144.02",
  "tax": "4.58",
  "items": [
    {
      "name": "TATER TOTS",
      "price": "2.96"
    },
    {
      "name": "HARD/PROV/DC",
      "price": "2.68"
    },
    {
      "name": "SNACK BARS",
      "price": "4.98"
    },
    // Additional items truncated for brevity
  ]
}

Processed in 6.43 seconds with no spatial data

Gemini With Structured Output and Detecting Bounding Boxes:

{
  "total": "144.02",
  "tax": "4.58",
  "items": [
    {
      "name": "TATER TOTS",
      "price": "2.96"
    },
    {
      "name": "HARD/PROV/DC",
      "price": "2.68"
    },
    {
      "name": "SNACK BARS",
      "price": "4.98"
    },
    // Additional items truncated for brevity
  ]
}

Processed in 6.15 seconds with no spatial data

Response - JigsawStack vOCR

{
  "success": true,
  "context": "Here are the details extracted from the receipt..."
  "sections": [
    {
      "text": "See back of receipt for your chance\n
               to win $1000 ID #: 7N5N1VIXCQDQ\n
               Walmart\n 317-851-1102 Mgr:JAMIE BROOKSHIRE\n
               882 S. STATE ROAD 135\nGREENWOOD IN 46143\n
               ST# 05483 OP# 001436 TE# 09 TR# 06976\n
               TATER TOTS\n001312000026 F 2.96 0\n...",
      "lines": [
        {
          "text": "See back of receipt for your chance",
          "bounds": {
            "top_left": { "x": 185, "y": 63 },
            "top_right": { "x": 459, "y": 76 },
            "bottom_right": { "x": 459, "y": 93 },
            "bottom_left": { "x": 184, "y": 84 },
            "width": 274.5, "height": 19
          },
          "words": [
            {
              "text": "See",
              "bounds": {
                // Bounding box data for each word
              }
            },
            // Additional words truncated
          ]
        },
        // Additional lines truncated
      ]
    }
  ],
  // Additional data truncated
}

Processed in 9.08 seconds with comprehensive positioning data

Gemini OCR Performance (Standard Output)

  • Accuracy: Good extraction of receipt data (total, tax, items)

  • Processing Time: 6.43 seconds

  • Output Quality: Clean JSON with essential receipt information

Gemini OCR Performance (With Structured Output Request)

  • Accuracy: Similar extraction quality, but still no positioning data

  • Processing Time: 6.15 seconds

  • Output Quality: Failed to provide spatial data despite explicit request

JigsawStack vOCR Performance

  • Accuracy: Comprehensive text extraction from the receipt

  • Processing Time: 9.08 seconds

  • Output Quality: Complete text capture with detailed positioning information

  • Organization: Includes precise position data for each text element

Analysis: While Gemini OCR processed the receipt faster, JigsawStack vOCR delivered substantially more detail, including the exact position of each text element. Which is crucial for applications requiring spatial understanding of the document.

Test 2: Multilingual Text Recognition

We evaluated a multilingual street sign containing Japanese characters and directional information.

Response - Gemini OCR

Gemini with Unstructured Output:

[
  "a",
  "a",
  "a",
  "a",
  "0.2 Km",
  "alamy",
  "四天王寺",
  "alamy",
  "a",
  "庚申堂>",
  "a",
  "0.1km",
  "a",
  "竹本義太夫墓",
  "●(超願寺内)。すぐ。",
  "a",
  "alamy",
  "Image ID: CBDNR6",
  "www.alamy.com",
  "a",
  "a",
  "a"
]

Processed in 3.02 seconds - limited structure and context

Gemini With Structured Output and Detecting Bounding Boxes:

{
  "success": true,
  "context": {},
  "width": 900,
  "height": 1200,
  "tags": [],
  "has_text": true,
  "sections": [
    {
      "text": "a\n
               a\n
               0.2 Km\n
               alamy\n
               四天王寺\n
               alamy\n
               a\n庚申堂>\n
               a\n0.1km\n
               a\n竹本義太夫墓\n
               ●(超願寺内)。すぐ。\n
               a\nalamy\n
               Image ID: CBDNR6\nwww.alamy.com",
      "lines": [
        {
          "text": "a",
          "bounds": {
            "top_left": { "x": 19, "y": 19 },
            "top_right": { "x": 34, "y": 19 },
            "bottom_right": { "x": 34, "y": 29 },
            "bottom_left": { "x": 19, "y": 29 },
            "width": 15,
            "height": 10
          },
          "words": [
            {
              "text": "a",
              "bounds": {
                "top_left": { "x": 19, "y": 19 },
                "top_right": { "x": 34, "y": 19 },
                "bottom_right": { "x": 34, "y": 29 },
                "bottom_left": { "x": 19, "y": 29 },
                "width": 15,
                "height": 10
              }
            }
          ]
        },
        
        //Lines omitted for brevity
      ]
    }
  ]
}

Processed in 30.67 seconds with spatial data

Response - JigsawStack vOCR

{
  "success": true,
  "context": "I'm unable to extract text from the image directly. However, I can help with general information or answer questions you might have!",
  "width": 1300,
  "height": 951,
  "tags": [
    "text", "screenshot", "rectangle", "font", "line",
    "number", "signage", "colorfulness"
  ],
  "has_text": true,
  "sections": [
    {
      "text": "a\n四天王寺\n
               a a a\n0.2Km\n
               alamy\nalamy\n
               a a\na\n
               0. 1km\n
               庚申心\n
               02 a\n
               alamy alamy\n
               竹本義太夫墓\n
               a\n
               (超願寺内)。すぐ。\n
               a\n
               alamy\n
               Image ID: CBDNR6\nwww.alamy.com",
      "lines": [
            {
               "text": "a",
               "bounds": {
               "top_left": {
                   "x": 1089,
                   "y": 24
               },
               "top_right": {
                   "x": 1106,
                   "y": 24
               },
               "bottom_right": {
                    "x": 1106,
                    "y": 49
               },
               "bottom_left": {
                     "x": 1089,
                      "y": 49
               },
                      "width": 17,
                      "height": 25
               },
        // Detailed line data with position information
      ]
    }
  ]
}

Processed in 6.98 seconds

Gemini OCR Performance (Standard Output)

  • Accuracy: Successfully captured Japanese characters

  • Processing Time: 3.02 seconds

  • Output Quality: Limited structure and contextual information

Gemini OCR Performance (With Structured Output Request)

  • Accuracy: Successfully captured Japanese characters

  • Processing Time: 30.67 seconds

  • Output Quality: Attempted to provide spatial data but with minimal context interpretation

JigsawStack vOCR Performance

  • Accuracy: Successfully recognized Japanese characters

  • Processing Time: 6.98 seconds (4.4× faster than Gemini's structured output)

  • Output Quality: Provided both structured data and a context summary

  • Organization: Included useful image tags (text, screenshot, rectangle, font, etc.)

Analysis: JigsawStack vOCR offered significantly better performance with comparable accuracy, processing the multilingual content more than four times faster than Gemini OCR's structured output while providing more meaningful context.

Test 3: Handwritten Text Recognition

We evaluated both systems on a handwritten poem with cursive and stylized text.

Response - Gemini OCR (Simplified):

Gemini with Unstructured Output:

[
  "The lovely Seng night may soup hing shineg",
  "Wensome and faranell my heart was beating",
  "th rosehush on fre moor the violet beautiful",
  "The artists, evening song new love new hiff",
  "To behinja Holde bili marst se lang Inerell farewell",
  "Non I leave this litle hunt where my beloved live",
  "Walking now with wiled steps through the lenses",
  "Luna shines throught busk and oak zephar per path",
  "And the bich trees bowing how shed incense on the trade",
  "How beautiful the coolness of this lovely summer night!",
  "Hon the asl fills with happines in this tul place of quiet!",
  "I can scarcely gross the bliss, jot Heaven I would shan",
  "A thousand nights like this if my darling granted one."
]

Processed in 3.79 seconds - numerous transcription errors

Gemini With Structured Output with detecting bounding boxes:

{ 
    //bounding boxes 
    "\"text\": \"The loure\\nWensome and faranell my heart\",\n
    \"lines\": [\n {\n \"text\": \"The loure\",\n
    \"bounds\": {\n \"top_left\": {\n \"x\": 58,\n \"y\": 48\n},\n            
    \"top_right\": {\n \"x\": 147,\n\"y\": 48\n },\n            
    \"bottom_right\": {\n \"x\": 147,\n\"y\": 73\n },\n            
    \"bottom_left\": {\n  \"x\": 58,\n \"y\": 73\n},\n            
    \"width\": 89,\n \"height\": 25\n}"
    "text\": \"th rosehush on fre the violet beautiful\\n
    My\\nThe artists, evening song\\nnew life\\n"
    

    //text output
    "To behinja Holde bili marst se lang Inerell farewell\\n
    Non I leave this litle hunt where my beloved live\\n
    Walking now with wiled steps through the lenses\\n
    Luna shines throught busk and oak zephar per path\\n
    And the bich trees bowing how shed incense on the trade\\n
    How beautiful the coolness of this lovely summer night!\\n
    Hon the asl fills with happines in this tul place of quiet!\\n
    I can scarcely scarcely gross the bliss, jot Heaven I would shan\\n
    A thousand nights like this if my darling granted one."
   
}

Processed in 37.55 seconds - numerous transcription errors

Response - JigsawStack vOCR

{
  "success": true,
  "context": "The lovely Spring night may come when she shines\n
              Welcome and farewell my heart was beating\n
              the rosebank on the river the violet beautiful\n
              The nights evening song we love new life\n
              to be alive this must we leave farewell\n
              Now I have this little hut where I heard him\n
              Walking now with naked steps through the doors\n
              when shines moonlight husk and oak zephyr perfake\n
              And the nice trees towering overhead incense on the road\n
              How beautiful the coolness of this lovely summer night!\n
              Even the old fills with happiness in this true place of quiet!\n
              I can scarcely grasp the bliss, yet Heaven, I would share\n
              A thousand nights like this if my darling granted one.",
  "width": 459,
  "height": 360,
  "tags": [
    "text", "handwriting", "letter", "calligraphy",
    "paper", "document", "font"
  ],
  "has_text": true,
  "sections": [
    {
      "text": "The lorey Seng night may comp ling stuing...",
      "lines": [
        // Detailed line-by-line data with bounding boxes
      ]
    }
  ]
}

Processed in 7.16 seconds - better contextual understanding and structure preservation

Gemini OCR Performance (Standard Output)

  • Accuracy: Captured handwritten text with numerous transcription errors

  • Processing Time: 3.79 seconds

  • Output Quality: Basic text extraction without spatial context

Gemini OCR Performance (With Structured Output Request)

  • Accuracy: Similar transcription errors as standard output

  • Processing Time: 37.55 seconds

  • Output Quality: Attempted to provide bounding boxes but with incomplete content

JigsawStack vOCR Performance

  • Accuracy: Better contextual understanding of handwritten content

  • Processing Time: 7.16 seconds (5× faster than Gemini's structured output)

  • Output Quality: Provided both raw text and a human-readable interpretation

  • Organization: Better preserved the meaning of the handwritten content

Analysis: JigsawStack vOCR demonstrated significantly better performance with improved accuracy, processing handwritten content about five times faster than Gemini OCR's structured output while delivering more contextually meaningful results.

Test 4: Structured Document (PDF) Processing

We evaluated an invoice PDF with tabular data, company information, and financial details.

Response - Gemini OCR

Gemini without prompt:

{
  "invoice_number": "3299",
  "invoice_date": "May 6, 2024",
  "due_date": "May 17, 2024",
  "po_number": "15",
  "billing_address": "Futurelink Solutions\nKlausdalsbrovej 601\nBallerup 2750\nDenmark",
  "company_address": "Sampleroad 14\nPostal 1410\nDenmark",
  "company_name": "Demo Business Partner",
  "balance_due": "€2,841.44",
  "line_items": [
    {
      "item": "47500177- Ø0.6mm Drill Guide",
      "quantity": "50",
      "rate": "€2.50",
      "amount": "€125.00"
    },
    // Additional line items truncated
  ],
  "subtotal": "€2,257.15",
  "tax": "€564.29",
  "shipping": "€20.00",
  "total": "€2,841.44",
  "notes": "Please pay in due time",
  "terms": "Terms of payment: Netto 10 days\nPlease transfer amount to account: Reg.nr. 1234 Konto nr. 0123456789\nWhen paying by bank transfer, please state invoice no."
}

Processed in 5.34 seconds - clean structured data

Gemini With Structured Output and Detecting Bounding Boxes:

{
  "success": true,
  "context": {},
  "width": 792,
  "height": 1122,
  "tags": [],
  "has_text": true,
  "sections": [
    {
      "text": "Demo Business Partner\nSampleroad 14\nPostal 1410\nDenmark",
      "lines": [
        {
          "text": "Demo Business Partner",
          "bounds": {
            "top_left": {"x": 130, "y": 24},
            "top_right": {"x": 268, "y": 24},
            "bottom_right": {"x": 268, "y": 37},
            "bottom_left": {"x": 130, "y": 37},
            "width": 137.859,
            "height": 13.11
          },
          "words": [
            {
              "text": "Demo",
              "bounds": {
                "top_left": {"x": 130, "y": 24},
                "top_right": {"x": 166, "y": 24},
                "bottom_right": {"x": 166, "y": 37},
                "bottom_left": {"x": 130, "y": 37},
                "width": 35.922,
                "height": 13.11
              }
            }
               //Lines omitted for brevity
    }
  ]
}

Processed in 38.17 seconds

Response - JigsawStack vOCR:

{
  "success": true,
  "context": "```json\n{\n  \"INVOICE\": {\n    \"#\": \"3299\",\n    
             \"Date\": \"May 6, 2024\",\n    \"Due Date\": \"May 17, 2024\",\n    
             \"PO Number\": \"15\",\n    \"Balance Due\": \"€2,841.44\"\n  },\n  
             \"Bill To\": {\n    \"Company\": \"Futurelink Solutions\",\n    
             \"Address\": \"Klausdalsbrovej 601\\nBallerup 2750\\nDenmark\"\n  },\n  
             \"Items\": [\n    {\n      \"Item\": \"47500177 - Ø0.6mm Drill Guide\",\n      
             \"Quantity\": \"50\",\n      \"Rate\": \"€2.50\",\n      
             \"Amount\": \"€125.00\"\n    },\n    // Additional items truncated\n  ],\n  
             \"Totals\": {\n    \"Subtotal\": \"€2,257.15\",\n    
             \"Tax (25%)\": \"€564.29\",\n    \"Shipping\": \"€20.00\",\n    
             \"Total\": \"€2,841.44\"\n  },\n  \"Notes\": [\n    
             \"Please pay in due time\"\n  ],\n  \"Terms\": [\n    
             \"Terms of payment: Netto 10 days\",\n    
             \"Please transfer amount to account: Reg.nr. 1234 Konto nr. 0123456789\",\n    
             \"When paying by bank transfer, please state invoice no.\"\n  ]\n}\n```",
  "total_pages": 1,
  "width": 612,
  "height": 792,
  "tags": [
    "text", "screenshot", "document", "font"
  ],
  "has_text": true,
  "sections": [
    {
      "text": "Demo Business Partner\nINVOICE\nSampleroad 14\n...",
      "lines": [
        // Detailed line data with positioning information
      ]
    }
  ]
}

Processed in 6.39 seconds - structured with additional document metadata

Gemini OCR Performance (Standard Output)

  • Accuracy: Good extraction of the invoice text

  • Processing Time: 5.34 seconds

  • Output Quality: Clean structured data focused on business-relevant fields

Gemini OCR Performance (With Structured Output Request)

  • Accuracy: Poor extraction when prompted for structured data with bounding boxes

  • Processing Time: 38.17 seconds

  • Output Quality: Only produced a few lines of data with coordinates

JigsawStack vOCR Performance

  • Accuracy: Excellent extraction with business context

  • Processing Time: 6.39 seconds (6× faster than Gemini's structured output)

  • Output Quality: Pre-structured JSON representation with items, totals, and metadata already parsed

  • Organization: Data organized hierarchically with clear section demarcation

Analysis: JigsawStack vOCR excelled with dramatically better speed and more useful structured output when compared to Gemini OCR's attempt at producing structured data with bounding boxes.

Key Findings

Without Structured Data Requirements

  • Speed Advantage: Gemini OCR consistently processed documents faster but provided less detailed output

  • Positioning Information: JigsawStack vOCR's inclusion of comprehensive bounding box data represents a significant advantage for applications requiring spatial understanding

With Structured Data Requirements

  • Performance Balance: JigsawStack vOCR offered better overall performance with faster processing times and more useful output structures

  • Specialized Use Cases: Gemini OCR performed well for basic receipt processing but struggled with complex documents requiring spatial information

Overall Assessment

  • Handwriting Recognition: JigsawStack demonstrated superior capabilities with greater accuracy and context preservation

  • Structured Output: JigsawStack offered more flexibility in customizing extraction fields and maintaining document relationships

  • Multilingual Support: JigsawStack appeared to have broader language support based on documentation and testing

Conclusion

Our benchmarking reveals that while Gemini OCR offers impressive speed for basic extraction, JigsawStack vOCR provides a more comprehensive solution with superior positional data, handwriting recognition, and structural understanding. For applications requiring detailed document analysis rather than basic text extraction, JigsawStack vOCR demonstrates clear advantages.

The choice between these solutions ultimately depends on specific use case requirements:

  • If processing speed for simple text extraction is paramount, Gemini OCR may be preferable

  • If spatial understanding, handwriting recognition, or detailed document structure analysis is needed, JigsawStack vOCR offers superior capabilities

Getting Started with JigsawStack vOCR

JavaScript

import { JigsawStack } from "jigsawstack";

const jigsawstack = JigsawStack({
  apiKey: "your-api-key",
});

const result = await jigsawstack.vision.vocr({
  prompt: ["total_price", "tax"],
  url: "https://example.com/receipt.jpg",
});

console.log(result);
// Output example:
// {
//   success: true,
//   total_price: "144.02",
//   tax: "4.58",
//   // Additional context and metadata...
// }

Python

from jigsawstack import JigsawStack

jigsawstack = JigsawStack(api_key="your-api-key")

result = jigsawstack.vision.vocr({
  "url": "https://example.com/receipt.jpg", 
  "prompt": ["total_price", "tax"]
})

print(result)
# Similar structured output as JavaScript example

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

Share this article