Customer Stories

Real teams. Real savings.

See how teams use Doubleword to run workloads that would be prohibitively expensive at real-time API rates.

A Custom PII Detection Model Trained for $50.
Data from Closed-Source Providers Would Have Cost 20× More.

How Dataiku's 575 Lab used Doubleword to generate the synthetic training data behind Kiji Privacy Proxy — an open-source tool that protects enterprise data in generative AI workflows.

Saved vs Closed-Source

95%

Synthetic PII training data generated for $50 total — covering 26 PII entity types across emails, phone numbers, credit cards, SSNs, IP addresses, and more. The same workload with closed-source models: ~$1,000.

PII entity types

<100ms

Proxy latency

95%

Cost reduction

Apache 2.0

Open source license

The Challenge

Every time an enterprise user sends a prompt to an external LLM, that prompt may contain customer names, email addresses, SSNs, medical records, or financial details that should never leave the organisation's environment. A 2026 Dataiku/Harris Poll study found that 85% of CIOs have seen AI projects delayed or blocked due to privacy gaps. Building a reliable PII detection model requires large volumes of labeled training data — and generating that data at real-time API rates would have made the project very expensive.

The Solution

By routing synthetic data generation through Doubleword's async inference API, Dataiku's 575 Lab produced the high quality training dataset for Kiji's DistilBERT PII detection model at 5% of what closed-source providers would have charged. The OpenAI-compatible API integrated with no friction. The resulting model detects 26 PII types locally, with all inference happening on-device — no external API calls during detection. Latency stays under 100ms.

"With Doubleword's batch inference platform, you can create your own large synthetic data [for custom PII models]."
— Kiji Privacy Proxy documentation

Cost Breakdown

Workload	Provider	Total Cost	vs Doubleword
Synthetic PII data generation	Doubleword	$50.00	—
Equivalent workload	Closed-source providers	~$1,000.00	20× more

The Result

Kiji Privacy Proxy is now live on GitHub under the Apache 2.0 license. The model and its training dataset are fully open on HuggingFace, meaning any team can inspect, reproduce, and extend them. For teams with domain-specific PII patterns — pharmaceutical identifiers, jurisdiction-specific ID formats, internal reference numbers — the full training pipeline is reproducible using Doubleword for synthetic data generation, Label Studio for annotation, and Metaflow for orchestration.

Why It Matters

Training a reliable PII detection model requires volume. Generating synthetic examples of sensitive data at sufficient scale and diversity is exactly the kind of high-volume, non-time-sensitive workload where async inference changes the economics entirely. At real-time API prices, this dataset would have cost $1,000 or more. At Doubleword's async rates, it cost $50 — making the entire project viable to open-source and reproducible by any team that wants to build their own domain-specific privacy proxy.

Try Kiji Privacy Proxy on GitHub Start generating synthetic data