Cryptographic receipts for every tool invocation. Verify what your agent actually did -- not what it claims it did.
LLM agents fabricate tool calls, return plausible-looking results from tools they never invoked, and confidently report success on operations that never happened.
There is no standard mechanism to verify whether an agent actually executed a tool call, invoked the correct parameters, or received the response it claims. Until now.
ToolProof sits between your agent and real tools -- as an HTTP proxy, SDK patch, or OpenClaw hook. Zero changes to your agent code.
Every tool call generates a cryptographic receipt: tool name, parameters, response, timestamp, and hash. Stored locally in SQLite.
Compare what the agent reported against what actually happened. Each call is scored: VERIFIED, UNVERIFIED, or TAMPERED.
Zero-config interception. Point your agent at the proxy, ToolProof handles the rest.
One-line patch for OpenAI and Anthropic SDKs. No proxy needed.
First-class hook and plugin support for OpenClaw agents.
Read Claude Code session logs directly. Audit past sessions retroactively.
AEGIS-style policy engine. Block or flag tool calls before they execute.
Per-call USD cost attribution. Know exactly what each tool invocation costs.
Shareable trust reports. Send a link, not a log file.
GitHub Action ready. Fail builds when trust score drops below threshold.
Native support for Hermes agent framework and OpenClaw skill ecosystem.
Analyze sessions and generate structured feedback. Karpathy-inspired eval-driven optimization built in.
Two full penetration testing rounds. Input validation, parameterized queries, no leaked secrets.
Inspired by Andrej Karpathy's eval philosophy: you don't improve what you don't measure. ToolProof closes the loop between agent execution and agent improvement. Every session generates structured analytics. Feed results back into your agent as optimization signal.
Execute your agent with ToolProof intercepting all tool calls. Receipts are generated automatically.
Every call, parameter, response, and cost is stored in SQLite. Full session history, queryable.
Run toolproof analyze to compute trust scores, failure patterns, cost distribution, and per-tool reliability metrics.
Run toolproof feedback --format hermes to generate structured improvement signals your agent framework can ingest directly.
Your agent receives concrete data on what went wrong and adjusts. The loop repeats. Trust scores trend upward.
$ toolproof wrap -- python agent.py [toolproof] intercepting tool calls... [toolproof] 71 tests | 18 modules | 17 commands [toolproof] trust score: A (1.00)
$ toolproof proxy --target http://localhost:3000 [toolproof] proxy listening on :8080 [toolproof] forwarding to http://localhost:3000
$ toolproof import-claude [toolproof] found 3 sessions [toolproof] 247 tool calls imported [toolproof] 231 verified, 12 unverified, 4 tampered
$ toolproof ci --min-trust 0.8 [toolproof] trust score: 0.94 -- PASS
$ toolproof analyze [toolproof] 71 tests | 18 modules | 17 commands [toolproof] trust trend: 0.82 -> 0.94 (+14.6%) [toolproof] top failure: file_read (3 unverified)
$ toolproof feedback --format hermes [toolproof] feedback generated: 4 improvement signals [toolproof] output: feedback-2026-04-06.json
The trust score is the ratio of verified tool calls to total claimed calls, weighted by call criticality. Every session gets a letter grade.
| Grade | Score | Meaning |
|---|---|---|
| A | 0.95 -- 1.00 | All or nearly all tool calls verified. High confidence. |
| B | 0.85 -- 0.94 | Most calls verified. Minor gaps, no tampering detected. |
| C | 0.70 -- 0.84 | Significant unverified calls. Review recommended. |
| D | 0.50 -- 0.69 | Majority of calls unverified or tampered. Do not trust output. |
| F | Below 0.50 | Agent is fabricating tool calls. Output is unreliable. |
Anthropic / Claude Code
OpenAI
Peter Steinberger / OpenClaw
Andrej Karpathy (eval philosophy)
LangChain
Microsoft AGT
Saudi AI Community