Doubleword
    Customer Stories

    Real teams. Real savings.

    See how teams use Doubleword to run workloads that would be prohibitively expensive at real-time API rates.

    Doubleword×Dataiku

    A Custom PII Detection Model Trained for $50.
    Data from Closed-Source Providers Would Have Cost 20× More.

    How Dataiku's 575 Lab used Doubleword to generate the synthetic training data behind Kiji Privacy Proxy — an open-source tool that protects enterprise data in generative AI workflows.

    Saved vs Closed-Source

    95%

    Synthetic PII training data generated for $50 total — covering 26 PII entity types across emails, phone numbers, credit cards, SSNs, IP addresses, and more. The same workload with closed-source models: ~$1,000.

    26

    PII entity types

    <100ms

    Proxy latency

    95%

    Cost reduction

    Apache 2.0

    Open source license

    The Challenge

    Every time an enterprise user sends a prompt to an external LLM, that prompt may contain customer names, email addresses, SSNs, medical records, or financial details that should never leave the organisation's environment. A 2026 Dataiku/Harris Poll study found that 85% of CIOs have seen AI projects delayed or blocked due to privacy gaps. Building a reliable PII detection model requires large volumes of labeled training data — and generating that data at real-time API rates would have made the project very expensive.

    The Solution

    By routing synthetic data generation through Doubleword's async inference API, Dataiku's 575 Lab produced the high quality training dataset for Kiji's DistilBERT PII detection model at 5% of what closed-source providers would have charged. The OpenAI-compatible API integrated with no friction. The resulting model detects 26 PII types locally, with all inference happening on-device — no external API calls during detection. Latency stays under 100ms.

    "With Doubleword's batch inference platform, you can create your own large synthetic data [for custom PII models]."

    — Kiji Privacy Proxy documentation

    Cost Breakdown

    WorkloadProviderTotal Costvs Doubleword
    Synthetic PII data generationDoubleword$50.00
    Equivalent workloadClosed-source providers~$1,000.0020× more

    The Result

    Kiji Privacy Proxy is now live on GitHub under the Apache 2.0 license. The model and its training dataset are fully open on HuggingFace, meaning any team can inspect, reproduce, and extend them. For teams with domain-specific PII patterns — pharmaceutical identifiers, jurisdiction-specific ID formats, internal reference numbers — the full training pipeline is reproducible using Doubleword for synthetic data generation, Label Studio for annotation, and Metaflow for orchestration.

    Why It Matters

    Training a reliable PII detection model requires volume. Generating synthetic examples of sensitive data at sufficient scale and diversity is exactly the kind of high-volume, non-time-sensitive workload where async inference changes the economics entirely. At real-time API prices, this dataset would have cost $1,000 or more. At Doubleword's async rates, it cost $50 — making the entire project viable to open-source and reproducible by any team that wants to build their own domain-specific privacy proxy.