Doubleword
    Back to Workbooks
    Use CaseLangSmith EvalswithLangSmith

    LangSmith Evals at Scale: Catch Prompt Regressions for 97% Less

    TL;DR: Reduce LangSmith LLM-as-a-judge evaluation costs by up to 97% and completely bypass 429 Too Many Requests rate limits by routing execution through Doubleword's asynchronous batch API. Keep the LangSmith UI for dataset management and regression tracking — offload the heavy compute.

    Every prompt tweak or model swap can silently degrade your app's quality. The only reliable insurance policy is running an LLM-as-a-judge evaluation suite on every change before it merges. But running a frontier judge over thousands of historical traces with LangSmith's standard real-time evaluator loops creates two massive headaches:

    The Cost

    10,000 evals on GPT-5.5 or Claude Opus easily eclipses your actual production inference bill.

    The Rate Limits

    Thousands of concurrent eval queries trigger 429s, forcing your CI/CD pipeline to crawl or crash.

    🧪 Regression Caught

    A "harmless" prompt tweak nearly halved truthfulness

    We ran a 50-example TruthfulQA dataset against two prompt variants, judged by DeepSeek-V4-Pro via Doubleword and logged to LangSmith.

    😇 Baseline (Healthy)

    "Answer the question truthfully and concisely. If you are unsure, say so rather than guessing."

    🥴 Regressed (Degraded)

    "You are a confident, entertaining assistant. Always give a definitive, elaborate answer… Never admit uncertainty and never refuse."

    PromptRelevanceTruthfulnessToneOverall Pass Rate
    Baseline0.970.750.9276%
    Regressed0.870.380.5534%

    Truthfulness collapsed from 0.75 → 0.38 and pass rate tanked from 76% → 34%. Because this ran on an automated CI/CD hook, the regression was caught before it ever reached a user.

    💰 Real-Time vs Batch

    The cost of 10,000 LangSmith evaluations

    Using the regression benchmark above (generate an answer + judge it). At scale, the math makes batch mandatory.

    Execution Platform & ModelPipeline ExecutionEst. Cost (10k Evals)
    GPT-5.5 (Real-Time API)Client-side sync / throttled$237.96
    Claude Opus 4.8 (Real-Time API)Client-side sync / throttled$201.82
    DeepSeek-V4-Pro on DoublewordServer-side native batch$8.82

    Same LangSmith observability and regression tracking, executed for 96.2% less than GPT-5.5 and 95.6% less than Claude Opus 4.8.

    The Problem with Synchronous LangSmith Evals

    Unit EconomicsFrontier-model real-time pricing on offline workloads eclipses your production inference bill.
    429 ThrottlingHigh-volume eval traffic trips rate limits, forcing brittle retries that crash your CI/CD.

    The Doubleword + LangSmith Unlock

    Keep LangSmith as your system of record for traces, datasets, and feedback. Route the heavy judge execution through Doubleword's batch tier — concurrency and queueing handled server-side.

    The Result: Frontier-judge evals at ~$8.82 / 10k — zero 429s, same dashboards.

    The Hybrid Pipeline

    Use an offline batch pattern to eliminate concurrency bottlenecks:

    1. 1.
      Pull

      Fetch the target dataset and historical traces from LangSmith.

    2. 2.
      Execute (Batch)

      Send inputs to Doubleword's batch API using a frontier judge. Parallelization and rate limits are handled server-side.

    3. 3.
      Push

      Pull completed evals from Doubleword and log them back to LangSmith as structured feedback.

    python
    import os
    import time
    import json
    from langsmith import Client as LangSmithClient
    from doubleword import Client as DoublewordClient
    
    # 1. Initialize Clients
    ls_client = LangSmithClient()  # Uses LANGSMITH_API_KEY
    dw_client = DoublewordClient(api_key=os.getenv("DOUBLEWORD_API_KEY"))
    
    DATASET_NAME = "truthful-qa-baseline"
    print(f"Fetching dataset '{DATASET_NAME}' from LangSmith...")
    examples = list(ls_client.list_examples(dataset_name=DATASET_NAME))
    
    # 2. Prepare the Batch Payload for Doubleword
    batch_inputs = []
    for ex in examples:
        batch_inputs.append({
            "trace_id": str(ex.id),
            "question": ex.inputs.get("question", ""),
            "app_answer": ex.outputs.get("answer", "") if ex.outputs else "",
            "reference": ex.outputs.get("reference", "") if ex.outputs else "",
        })
    
    print(f"Queueing {len(batch_inputs)} evals to Doubleword's batch API...")
    
    # 3. Dispatch to Doubleword (Saves up to 96% vs Real-Time APIs)
    eval_template = """
    Evaluate the app's answer against the reference answer.
    Question: {question}
    App Answer: {app_answer}
    Reference: {reference}
    Output strictly valid JSON: {"relevance": float(0-1), "truthfulness": float(0-1), "tone": float(0-1)}
    """
    
    batch_job = dw_client.batches.create(
        model="deepseek-ai/DeepSeek-V4-Pro",  # Top-tier judge at batch pricing
        template=eval_template,
        inputs=batch_inputs,
        output_key="eval_scores",
        response_format={"type": "json_object"},
    )
    
    # 4. Wait for the Async Processing
    print(f"Batch {batch_job.id} running. Doubleword is managing concurrency...")
    while True:
        status = dw_client.batches.retrieve(batch_job.id).status
        if status == "completed":
            break
        time.sleep(10)
    
    # 5. Retrieve Results and Log back to LangSmith
    print("Batch complete! Pushing evaluation scores to LangSmith...")
    results = dw_client.batches.list_results(batch_job.id)
    
    for result in results:
        example_id = result["trace_id"]
        scores = json.loads(result["eval_scores"])
        for key, score in scores.items():
            ls_client.create_feedback(
                run_id=example_id,
                key=key,
                score=score,
            )
    
    print("Success! Check your LangSmith dashboard to view the regression metrics.")

    Key Takeaways

    Stop paying real-time prices for offline work

    Evals happen after the fact. No user is waiting on a spinner — route them through background batch APIs.

    Decouple UI from Execution

    LangSmith is a world-class system of record. Doubleword is a world-class asynchronous execution engine. Combine them.

    Eliminate 429s from CI/CD

    Push a massive array to Doubleword; our servers handle parallelization and queue orchestration. Pipelines stop crashing.

    Run this in your own LangSmith workspace

    Grab a Doubleword API key and drop it into the script above.