Real teams. Real savings.
See how teams use Doubleword to run workloads that would be prohibitively expensive at real-time API rates.
×
×
119,000 Medical Images Annotated for $452.
Claude Sonnet 4.6 Would Have Cost $7,487.
How OpenMed used Doubleword to make frontier-model knowledge distillation viable at dataset scale — and what it means for anyone building with synthetic data.
Saved vs Anthropic
94%119K medical images annotated with two frontier VLMs, cross-validated at 93% agreement, producing 110K training records — for $452.58 total.
119K
Images annotated
93%
Cross-validation agreement
110K
Training records
+15%
Exact match improvement
The Challenge
Medical VQA datasets are small (VQA-RAD has just 314 training samples), narrow in coverage, and often restrictively licensed. Frontier VLMs can produce clinical analyses but cost $10–$50 per 1,000 images at real-time rates. Small 2–3B models are deployable but lack medical knowledge. Knowledge distillation at 119K images would be prohibitively expensive.
The Solution
By routing the entire annotation pipeline through Doubleword's async inference API, OpenMed ran two full annotation passes and two cross-validation passes over 119,137 images using Qwen 3.5 (397B) and Kimi K2.5 (1T). The OpenAI-compatible API required no pipeline changes — only the endpoint changed.
"The Doubleword team worked with us on batch annotation at scale. Their API made it economically viable to run two full annotation passes plus two cross-validation passes over 119K images with frontier reasoning models."
— Maziyar Panahi, Founder of OpenMed
Cost Breakdown
| Model | Provider | Total Cost | vs Doubleword |
|---|---|---|---|
| Qwen3.5-397B + Kimi-K2.5 | Doubleword | $452.58 | — |
| Qwen3.5-397B + Kimi-K2.5 | Alibaba Cloud + Moonshot AI | $1,393.39 | 3.1× more |
| Gemini 3 Flash | $1,486.00 | 3.3× more | |
| GPT-5 | OpenAI | $4,909.00 | 10.8× more |
| Claude Sonnet 4.6 | Anthropic | $7,487.00 | 16.5× more |
The Result
110,741 validated medical VQA records, open-sourced in full: datasets, model adapters, and code. Fine-tuning three small model families (2–3B parameters) on the synthetic dataset improved benchmarks across every model and every task. Best result: +15.0% average exact match improvement on Qwen3.5-2B.
Why It Matters
Synthetic data generation and large-scale annotation are among the most cost-sensitive workloads in AI. They are high-volume, non-time-sensitive, and directly constrained by inference budget. At real-time API prices, only well-funded labs can annotate at the scale needed to produce useful training data. Async inference removes that constraint entirely — and makes the entire pipeline reproducible by anyone.