Beta

Solving In-Scene Text Translation: What Works, What Doesn’t, and What’s Next

Translating text in images is a challenging but necessary task for global communication. With our Image-to-Image Translation Model, we aim to develop a seamless solution that can extract, translate, and overlay text back into images while preserving their original style. Here’s a look at our progress so far, where we’re headed, and where we’d love your input.

Version 1: Core Pipeline

Our first iteration focuses on a four-step process to ensure high-quality text translation and integration:

OCR (Optical Character Recognition) – Extract text and bounding boxes from an image using PaddleOCR.
Define Target Text – Translate extracted text using our translation API.
Inpaint – Erase detected text areas, creating a clean canvas for new text placement.
Overlay – Render the translated text with a similar font in the erased areas to maintain visual consistency.

Image by AnyTrans

Current Progress

We have successfully implemented OCR extraction and text inpainting to remove existing text. The major challenge now lies in defining the geometry for text erasure and rendering—especially for complex cases like curved or stylized text. We are actively iterating on this problem to ensure smooth and natural-looking translations.

How should we handle cases where text follows a non-linear path? Can we ensure new text blends naturally into different textures? These are some of the open questions we’re still working through.

Version 2: Can a Diffusion Model Solve This?

While our first version lays the groundwork, our long-term goal is to train a diffusion model capable of generating high-quality translated text in any language and script. This will allow for more accurate text placement and rendering, especially for non-Latin languages like Hindi, Arabic, and Chinese.

We're building on insights from prior open-source research, drawing particular inspiration from AnyText by Yuxiang Tuo et al. This paper introduces a framework that combines large language models (LLMs) with text-guided diffusion models. By leveraging this approach, we enable context-aware translation and seamless text integration, ensuring that the final output maintains the image’s original aesthetic.

Although prior works like DiffUTE, AnyText, STEFFAN, DiffSTE, Stable-Flow, and Translatron-V have tackled in-scene text editing and machine translation, gaps still remain. Through our own experimentation and analysis of their stated limitations, we've identified key areas for improvement.

Currently, text rendering in diffusion models relies on ControlNet, a deep neural network that ensures text closely follows the original geometry by adding and removing noise recursively while applying style transfer. However, rendering non-Latin glyphs (e.g., Hindi) remains a significant challenge. Research into methods like GlyphControl and Conditional Control has shown promise in making visual text generation font-agnostic.

With our V2, we aim to address these limitations, enabling seamless in-scene machine translation using inpainting and fill techniques.

What’s Next?

Finalizing the V1 pipeline: Refining how text erasure and rendering work for different text orientations.
Exploring diffusion-based text generation: Training a model that can handle diverse scripts and writing styles.
Gathering community feedback: We know we’re not the only ones thinking about this, and we’d love your input!

What do you think? Have you worked with in-scene text rendering or multilingual text generation? Got ideas for better overlay techniques? Join the discussion on Discord or X – we’d love to hear your thoughts!