Customer Stories

Real teams. Real savings.

See how teams use Doubleword to run workloads that would be prohibitively expensive at real-time API rates.

119,000 Medical Images Annotated for $452.
Claude Sonnet 4.6 Would Have Cost $7,487.

How OpenMed used Doubleword to make frontier-model knowledge distillation viable at dataset scale — and what it means for anyone building with synthetic data.

Saved vs Anthropic

94%

119K medical images annotated with two frontier VLMs, cross-validated at 93% agreement, producing 110K training records — for $452.58 total.

119K

Images annotated

93%

Cross-validation agreement

110K

Training records

+15%

Exact match improvement

The Challenge

Medical VQA datasets are small (VQA-RAD has just 314 training samples), narrow in coverage, and often restrictively licensed. Frontier VLMs can produce clinical analyses but cost $10–$50 per 1,000 images at real-time rates. Small 2–3B models are deployable but lack medical knowledge. Knowledge distillation at 119K images would be prohibitively expensive.

The Solution

By routing the entire annotation pipeline through Doubleword's async inference API, OpenMed ran two full annotation passes and two cross-validation passes over 119,137 images using Qwen 3.5 (397B) and Kimi K2.5 (1T). The OpenAI-compatible API required no pipeline changes — only the endpoint changed.

"The Doubleword team worked with us on batch annotation at scale. Their API made it economically viable to run two full annotation passes plus two cross-validation passes over 119K images with frontier reasoning models."
— Maziyar Panahi, Founder of OpenMed

Cost Breakdown

Model	Provider	Total Cost	vs Doubleword
Qwen3.5-397B + Kimi-K2.5	Doubleword	$452.58	—
Qwen3.5-397B + Kimi-K2.5	Alibaba Cloud + Moonshot AI	$1,393.39	3.1× more
Gemini 3 Flash	Google	$1,486.00	3.3× more
GPT-5	OpenAI	$4,909.00	10.8× more
Claude Sonnet 4.6	Anthropic	$7,487.00	16.5× more

The Result

110,741 validated medical VQA records, open-sourced in full: datasets, model adapters, and code. Fine-tuning three small model families (2–3B parameters) on the synthetic dataset improved benchmarks across every model and every task. Best result: +15.0% average exact match improvement on Qwen3.5-2B.

Why It Matters

Synthetic data generation and large-scale annotation are among the most cost-sensitive workloads in AI. They are high-volume, non-time-sensitive, and directly constrained by inference budget. At real-time API prices, only well-funded labs can annotate at the scale needed to produce useful training data. Async inference removes that constraint entirely — and makes the entire pipeline reproducible by anyone.

Read the full SynthVision paper Start annotating at scale