Translating text in images is a challenging but necessary task for global communication. With our Image-to-Image Translation Model, we aim to develop a seamless solution that can extract, translate, and overlay text back into images while preserving their original style. Here’s a look at our progress so far, where we’re headed, and where we’d love your input.
Our first iteration focuses on a four-step process to ensure high-quality text translation and integration:

Image by AnyTrans
We have successfully implemented OCR extraction and text inpainting to remove existing text. The major challenge now lies in defining the geometry for text erasure and rendering—especially for complex cases like curved or stylized text. We are actively iterating on this problem to ensure smooth and natural-looking translations.
How should we handle cases where text follows a non-linear path? Can we ensure new text blends naturally into different textures? These are some of the open questions we’re still working through.
While our first version lays the groundwork, our long-term goal is to train a diffusion model capable of generating high-quality translated text in any language and script. This will allow for more accurate text placement and rendering, especially for non-Latin languages like Hindi, Arabic, and Chinese.
We're building on insights from prior open-source research, drawing particular inspiration from AnyText by Yuxiang Tuo et al. This paper introduces a framework that combines large language models (LLMs) with text-guided diffusion models. By leveraging this approach, we enable context-aware translation and seamless text integration, ensuring that the final output maintains the image’s original aesthetic.
Although prior works like DiffUTE, AnyText, STEFFAN, DiffSTE, Stable-Flow, and Translatron-V have tackled in-scene text editing and machine translation, gaps still remain. Through our own experimentation and analysis of their stated limitations, we've identified key areas for improvement.
Currently, text rendering in diffusion models relies on ControlNet, a deep neural network that ensures text closely follows the original geometry by adding and removing noise recursively while applying style transfer. However, rendering non-Latin glyphs (e.g., Hindi) remains a significant challenge. Research into methods like GlyphControl and Conditional Control has shown promise in making visual text generation font-agnostic.
With our V2, we aim to address these limitations, enabling seamless in-scene machine translation using inpainting and fill techniques.
What do you think? Have you worked with in-scene text rendering or multilingual text generation? Got ideas for better overlay techniques? Join the discussion on Discord or X – we’d love to hear your thoughts!