ToolProof v0.4.0 -- Agent Tool Verification

The Problem

Your agent says it called the tool. It didn't.

91.1%

tool hallucination rate under adversarial conditions

LLM agents fabricate tool calls, return plausible-looking results from tools they never invoked, and confidently report success on operations that never happened.

There is no standard mechanism to verify whether an agent actually executed a tool call, invoked the correct parameters, or received the response it claims. Until now.

How It Works

Intercept. Receipt. Verify.

Intercept

ToolProof sits between your agent and real tools -- as an HTTP proxy, SDK patch, or OpenClaw hook. Zero changes to your agent code.

Receipt

Every tool call generates a cryptographic receipt: tool name, parameters, response, timestamp, and hash. Stored locally in SQLite.

Verify

Compare what the agent reported against what actually happened. Each call is scored: VERIFIED, UNVERIFIED, or TAMPERED.

Agent ----> ToolProof Proxy ----> Real Tool | receipt generated | stored in SQLite | verified later | VERIFIED / UNVERIFIED / TAMPERED

Features

Everything you need to audit agent behavior.

HTTP Proxy

Zero-config interception. Point your agent at the proxy, ToolProof handles the rest.

SDK Auto-Patch

One-line patch for OpenAI and Anthropic SDKs. No proxy needed.

OpenClaw Native

First-class hook and plugin support for OpenClaw agents.

Claude Code Import

Read Claude Code session logs directly. Audit past sessions retroactively.

Pre-Execution Gating

AEGIS-style policy engine. Block or flag tool calls before they execute.

Token Cost Tracking

Per-call USD cost attribution. Know exactly what each tool invocation costs.

HTML Reports

Shareable trust reports. Send a link, not a log file.

CI Integration

GitHub Action ready. Fail builds when trust score drops below threshold.

Hermes + OpenClaw

Native support for Hermes agent framework and OpenClaw skill ecosystem.

Eval Loop

Analyze sessions and generate structured feedback. Karpathy-inspired eval-driven optimization built in.

Security Hardened

Two full penetration testing rounds. Input validation, parameterized queries, no leaked secrets.

Eval-Driven Optimization

Run. Record. Analyze. Feedback. Improve.

Inspired by Andrej Karpathy's eval philosophy: you don't improve what you don't measure. ToolProof closes the loop between agent execution and agent improvement. Every session generates structured analytics. Feed results back into your agent as optimization signal.

Run

Execute your agent with ToolProof intercepting all tool calls. Receipts are generated automatically.

Record

Every call, parameter, response, and cost is stored in SQLite. Full session history, queryable.

Analyze

Run toolproof analyze to compute trust scores, failure patterns, cost distribution, and per-tool reliability metrics.

Feedback

Run toolproof feedback --format hermes to generate structured improvement signals your agent framework can ingest directly.

Improve

Your agent receives concrete data on what went wrong and adjusts. The loop repeats. Trust scores trend upward.

Agent Run ----> Record Receipts ----> toolproof analyze | toolproof feedback --format hermes | Agent Ingests Feedback | Next Run Improves

Quick Start

Six commands. That's it.

Wrap any agent

$ toolproof wrap -- python agent.py
  [toolproof] intercepting tool calls...
  [toolproof] 71 tests | 18 modules | 17 commands
  [toolproof] trust score: A (1.00)

Run as HTTP proxy

$ toolproof proxy --target http://localhost:3000
  [toolproof] proxy listening on :8080
  [toolproof] forwarding to http://localhost:3000

Audit Claude Code sessions

$ toolproof import-claude
  [toolproof] found 3 sessions
  [toolproof] 247 tool calls imported
  [toolproof] 231 verified, 12 unverified, 4 tampered

CI pipeline gate

$ toolproof ci --min-trust 0.8
  [toolproof] trust score: 0.94 -- PASS

Analyze sessions

$ toolproof analyze
  [toolproof] 71 tests | 18 modules | 17 commands
  [toolproof] trust trend: 0.82 -> 0.94 (+14.6%)
  [toolproof] top failure: file_read (3 unverified)

Generate feedback for agent

$ toolproof feedback --format hermes
  [toolproof] feedback generated: 4 improvement signals
  [toolproof] output: feedback-2026-04-06.json

Trust Score

A single grade for agent reliability.

The trust score is the ratio of verified tool calls to total claimed calls, weighted by call criticality. Every session gets a letter grade.

Grade	Score	Meaning
A	0.95 -- 1.00	All or nearly all tool calls verified. High confidence.
B	0.85 -- 0.94	Most calls verified. Minor gaps, no tampering detected.
C	0.70 -- 0.84	Significant unverified calls. Review recommended.
D	0.50 -- 0.69	Majority of calls unverified or tampered. Do not trust output.
F	Below 0.50	Agent is fabricating tool calls. Output is unreliable.

Built With

Standing on the shoulders of giants.

Anthropic / Claude Code
OpenAI
Peter Steinberger / OpenClaw
Andrej Karpathy (eval philosophy)
LangChain
Microsoft AGT
Saudi AI Community