Doubleword
    Back to Workbooks
    Use CaseSynthetic Data Generation

    Generate 10,000 High-Fidelity Training Samples for $3.21

    The Challenge: The Data Bottleneck

    Fine-tuning models works, but acquiring the data is the ultimate bottleneck.

    Human AnnotationCosts $1–$5 per sample and takes weeks.
    Real-time APIsFast, but generating 10,000 samples will instantly burn hundreds of dollars.
    Standard AsyncCheap, but impossibly slow. A standard 3-stage pipeline requires waiting 24 hours between each step, taking three full days.

    The Doubleword Unlock

    Doubleword provides a high-throughput async inference engine built for multi-stage pipelines.

    The Result: Complete complex, multi-pass data generation pipelines in a fraction of the time. Drop generation costs by 97%, allowing you to iterate on your datasets exactly like you iterate on hyperparameters.

    📊 Case Study

    The Economics of Synthetic Data

    Dataset Generation Workload: 10,000 synthetic customer support conversations for fine-tuning, featuring controlled difficulty levels and topic coverage.

    Pipeline Structure: 3-stage map-reduce (Scenarios → Conversations → Quality Filter)

    10,000

    Samples Generated

    19.5M

    Total Tokens

    84%

    Pass Rate

    8,420

    High-Quality Samples

    ProviderTotal Cost
    Doubleword$3.21
    Doubleword$6.13
    OpenAI$108.83
    Anthropic$154.62

    The Result: At $3.21 for 10,000 samples, the cost of generating data drops below the cost of manually curating it. You can afford to generate massive datasets, throw away the bottom 20%, and still pay a fraction of real-time API rates.

    Generate Abundantly, Curate Aggressively

    When inference costs drop by 97%, the approach to synthetic data changes. You no longer try to generate the "minimum viable dataset." Instead, you over-generate, run strict automated quality filters, and aggressively discard anything that isn't perfect.

    Our recommended architecture utilizes a 3-Stage Async Pipeline leveraging Structured Outputs:

    01

    Scenario Generation

    Enqueue a workload to create thousands of unique customer scenarios with strict JSON schemas enforcing controlled attributes (e.g., 40% easy, 35% medium, 25% hard across 15 distinct topics).

    02

    Conversation Generation

    Dispatch the generated scenarios back into the async queue. The model acts as both the customer and the support agent, generating a multi-turn dialogue formatted as a strict JSON array.

    03

    LLM-as-a-Judge (Quality Filtering)

    Run a final async pass using a heavier model (like Qwen 235B) to score the generated conversations for naturalness and helpfulness. Automatically discard any sample scoring below a 3.5/5.

    How the Async Pipeline Works

    Instead of locking up your application server for hours waiting for LLM responses, you orchestrate the pipeline entirely in the background:

    01

    Enqueue Pass 1

    Submit your prompt templates and schema definitions to Doubleword's batch API.

    02

    Decouple

    Your pipeline orchestrator (like Airflow or a custom script) pauses. No HTTP connections are held open.

    03

    Webhook Trigger

    Doubleword processes the 10,000 scenarios in our high-throughput queue and hits your webhook upon completion.

    04

    Auto-Trigger Pass 2 & 3

    Your system automatically ingests the data and immediately dispatches the next stage of the pipeline back to Doubleword.

    By using high-throughput async queues, a pipeline that would take 3 wall-clock days on standard async infrastructure completes in hours.

    Ready to build your own Synthetic Data Factory?

    Stop letting API costs dictate the size and quality of your fine-tuning datasets. Shift your heavy data pipelines to the background.