Workflow Latency: Passing OCR Text to Gemini Reasoning Block without Visual Re-processing

I am building a three-stage medical verification “Vision Agent” in Workflows. My pipeline currently looks like this:

  1. Object Detection (RF-DETR): Locates the medical device screen.

  2. Dynamic Crop: Zooms in on the screen for better resolution.

  3. OCR (Gemini 2.5 Flash): Extracts the raw text/numerical values from the crop.

  4. Structured Reasoning (Gemini 2.5 Flash): Performs a medical logic check or checks for faulty equipment (e.g., “Is 180% SpO2 physically possible?”).

The problem is that in Block 5 (Reasoning), I want to use the text output from Block 4 as the primary input. However, the Gemini block seems to require an image input to function. This forces the model to re-analyze the pixels of the image it just saw in the previous step, creating redundant processing and significant latency.

My question is whether there is a way to pass the string output from the OCR block into the next Gemini block’s prompt without having to re-attach the image? If the image is mandatory for the block to run, is there a way to optimize this process so the model doesn’t re-run full vision inference twice in a row?

Project Type: Object Detection
Operating System & Browser: Windows / Google Chrome
Workflow ID: Workflow
Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): Yes

Hi there,

Your workflow setup looks solid, but you’re running into a common optimization challenge with multi-stage LLM processing.

Unfortunately, the Gemini reasoning block in Workflows currently requires an image input to function, which means you can’t pass just the OCR text output without also providing an image.

However, there are a couple of approaches you can try to reduce that redundant processing latency. First, you could pass a smaller, lower-resolution version of the cropped image to the reasoning block since it’s primarily working with the text data anyway. The model will still need the image input to satisfy the block requirements, but using a downscaled version could speed things up.

Another option is to restructure your workflow to combine steps 3 and 4 into a single Gemini block. Instead of doing separate OCR and reasoning calls, you could prompt the model to both extract the text AND perform the medical logic check in one pass. Your prompt might look something like: “Extract all text and numerical values from this medical device screen, then analyze if any readings are physically impossible (e.g., SpO2 over 100%). Return both the raw data and your analysis.”

This approach eliminates the duplicate vision processing entirely since you’re only making one call to Gemini. The downside is less modularity, but for latency-critical applications like medical verification, the performance gain is usually worth it.

Relevant Resources:

If you need the modular approach for other reasons, let me know and we can explore other optimization strategies.

Best,

Bar Shimshon