Multi-hop Agent Chains Don't Have Receipts

May 20, 2026 ai-governance agentic-systems multi-agent compliance cryptography eu-ai-act langchain

Consider a common agentic pipeline: Claude handles task planning, Mistral handles execution, GPT-4o handles quality review before final output is delivered to the user.

Three models. Three calls. Three proofs — if you're lucky.

The problem isn't whether each hop is auditable in isolation. Modern observability tools handle individual LLM calls reasonably well. The problem is the chain between hops. When GPT-4o reviewed the output, was it reviewing the actual result Mistral produced, or something that changed in transit? When Mistral executed the plan, did it execute Claude's plan, or a modified version? The individual proofs don't answer these questions. You need composable proofs — where each hop's proof commits to the prior hop's result.

Most pipelines don't have them.

Where the Gap Opens

Take a concrete sequence:

Claude (planner) → hop 1 → Mistral (executor) → hop 2 → GPT-4o (reviewer) → final output

Each model call generates observable data: request payload, response body, latency, token count. Your observability stack captures all of it. But two things are missing at each hop boundary:

1. Input provenance binding. When GPT-4o receives Mistral's output for review, nothing cryptographically binds the input to the prior model's actual response. The input could be Mistral's verbatim output. It could be a slightly modified version — by middleware, by a bug in the pipeline orchestration, by an adversarial injection. The GPT-4o call record shows what GPT-4o was given, not whether that input was authentic.

2. Chain continuity proof. The chain is only as strong as its weakest handoff. If hop 1 is proven but hop 1→2 boundary is unattested, you have two separate islands of observability with an unverified gap between them. Under EU AI Act Article 9, you need to demonstrate that your system behaved as designed from input to output — not just that individual components functioned correctly.

Here's the gap visualized:

Claude proof  ←  [unverified boundary]  →  Mistral proof  ←  [unverified boundary]  →  GPT-4o proof
     ✓                    ?                      ✓                    ?                     ✓

The proof chain has holes at every handoff.

Why This Matters More Than It Looks

A single-model call with an audit trail is defensible. A three-model pipeline with three disconnected audit trails is not.

Regulators don't ask "can you prove what Claude produced?" They ask "can you prove the final output traces back to authorized inputs, through an unbroken chain of custody?" That's a different question.

Two failure modes where disconnected proofs become a liability:

Incident attribution. A multi-model pipeline produces an output that causes a compliance incident — incorrect data used in a financial decision, a privacy leak, a biased recommendation. Your audit trail shows three model calls. It doesn't tell you whether the incident originated in Claude's planning step, Mistral's execution, or GPT-4o's review. Without chain continuity, root cause analysis degrades to best-guess reconstruction.

Retroactive audit. EU AI Act Article 9 requires documentation of "measures to examine, test and validate datasets" and monitoring of "changes in the level of performance." When your pipeline uses three models with different update schedules, and an audit occurs 60 days later, you need to prove which model versions ran at each hop. Scattered call logs across three different provider dashboards — each with its own retention policy — don't constitute a chain of custody.

Free tier: 500 proofs/month, no credit card required.

See plans & get free key

What Composable Proof Looks Like

The mechanism isn't novel. Certificate chains in TLS, Git commit graphs, append-only Merkle logs — all solve the same composability problem. Each node commits to its parent, so you can verify any segment or the full chain from a single root.

For multi-hop agent pipelines, the pattern translates directly:

import hashlib
import json
import time

def create_hop_proof(model: str, input_hash: str, output: dict, prior_proof_id: str = None) -> dict:
    """
    Creates a proof for one hop that chains to the previous hop's proof.
    """
    timestamp = time.time()
    output_hash = hashlib.sha256(
        json.dumps(output, sort_keys=True).encode()
    ).hexdigest()

    proof = {
        "model": model,
        "timestamp": timestamp,
        "input_hash": input_hash,
        "output_hash": output_hash,
        "prior_proof_id": prior_proof_id,  # binds this hop to the previous one
    }

    # The proof ID commits to all fields, including the prior_proof_id
    proof_id = hashlib.sha256(
        json.dumps(proof, sort_keys=True).encode()
    ).hexdigest()
    proof["proof_id"] = proof_id

    return proof


def verify_chain(proofs: list[dict]) -> bool:
    """
    Verifies that proofs form an unbroken chain: each proof's prior_proof_id
    matches the previous proof's proof_id, and each output_hash matches the
    next proof's input_hash.
    """
    for i in range(1, len(proofs)):
        current = proofs[i]
        previous = proofs[i - 1]

        # Prior proof linkage
        if current["prior_proof_id"] != previous["proof_id"]:
            return False

        # Input provenance: this hop's input must be previous hop's output
        if current["input_hash"] != previous["output_hash"]:
            return False

    return True


# Usage in a three-model pipeline
def run_auditable_pipeline(task: str) -> tuple[str, list[dict]]:
    chain = []

    # Hop 1: Claude plans
    task_hash = hashlib.sha256(task.encode()).hexdigest()
    claude_output = call_claude(task)
    proof_1 = create_hop_proof("claude-opus-4-6", task_hash, claude_output)
    chain.append(proof_1)

    # Hop 2: Mistral executes
    claude_output_hash = proof_1["output_hash"]
    mistral_output = call_mistral(claude_output)
    proof_2 = create_hop_proof("mistral-large", claude_output_hash, mistral_output, proof_1["proof_id"])
    chain.append(proof_2)

    # Hop 3: GPT-4o reviews
    mistral_output_hash = proof_2["output_hash"]
    final_output = call_gpt4o(mistral_output)
    proof_3 = create_hop_proof("gpt-4o", mistral_output_hash, final_output, proof_2["proof_id"])
    chain.append(proof_3)

    # Chain integrity check before returning
    assert verify_chain(chain), "Chain integrity verification failed"

    return final_output, chain

The key property: prior_proof_id in each hop's proof commits to the previous hop's complete proof record. This means:

You can verify any individual hop in isolation
You can verify the full chain from task input to final output
If any boundary was tampered with, verify_chain fails and the chain is invalid

The Boundary That Most Implementations Miss

The input_hash field in each hop's proof is load-bearing. It means the input to GPT-4o (Hop 3) must be exactly the output Mistral produced (Hop 2). Not similar. Not "the same content with whitespace normalized." The same bytes, same encoding.

This constraint surfaces a problem most pipeline engineers haven't thought about: serialization boundaries.

When Claude produces a structured plan as a Python dict, and you serialize it to pass to Mistral as a string prompt, and Mistral produces text that you parse back into a dict, and you then serialize that again for GPT-4o — the hash at each step depends on your serialization convention. If json.dumps(output, sort_keys=True) produces different bytes in different Python versions, your proof chain is broken.

This is why independent verification infrastructure matters. You want a single, consistent proof-generation layer that handles serialization deterministically across all hops, rather than implementing it inside each model integration separately.

Where This Lands in the Compliance Picture

EU AI Act Article 9 ("Risk management system") and Article 17 ("Quality management system") both require documentation of how multi-component AI systems behave across their lifecycle. For a pipeline with three models, "the system behaves as documented" requires proving continuity across hops, not just individual component behavior.

The composable proof pattern gives you two things you don't have with disconnected logs:

A single verifiable chain ID — one reference that covers the entire pipeline execution, provable to any auditor without requiring access to three different provider dashboards
Tamper detection at hop boundaries — if any handoff was altered, verify_chain fails, which surfaces the incident rather than silently producing a garbage output with a clean audit trail

If you're building multi-model pipelines today and composable chain proofs aren't on your roadmap yet, you're accumulating compliance debt against an August 2026 deadline.

The individual proofs are necessary. The chain between them is what's missing.

ArkForge Trust Layer provides independent composable attestation for multi-model pipelines — chain proofs verifiable across any combination of models and providers.

Prove it happened. Cryptographically.

ArkForge generates independent, verifiable proofs for every API call your agents make. Free tier included.

Compare plans → or get free key directly