joshawome/chainreason
joshawome/chainreasonA benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks
From the README
ChainReason
A small benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks.
ChainReason is a lightweight evaluation suite that asks language models to do five things a smart-contract engineer or DeFi analyst would consider routine:
protocol_qa— multiple-choice questions about specific DeFi protocol mechanics.vuln_detect— classify a Solidity snippet by vulnerability category.contract_class— classify a contract from its ABI summary + an optional hint.tx_intent— given a sequence of decoded actions, infer the transaction's intent.slippage_pred— given an AMM pool state and a swap, compute the output amount.
The point of having five tasks instead of one is that each one stresses a different
capability — symbolic reasoning, code understanding, structural pattern recognition,
numeric reasoning. A model that's strong on vuln_detect but weak on slippage_pred
tells you something different than a model that's strong on both.
Why another benchmark
Existing benchmarks for Solidity / blockchain LLMs largely focus on either (a) code generation, or (b) vulnerability detection. ChainReason adds three other axes that I haven't seen consolidated elsewhere:
- Protocol-level reasoning. Knowing what
getReserves()returns is one thing; knowing what happens when you yank 30% of the reserves out of a Uniswap v2 pair is another. - Transaction-graph understanding. Telling a sandwich apart from a swap from an arbitrage requires looking at the structure of an execution trace, not just opcodes.
- Numeric grounding. AMMs have closed-form pricing. If a model gets the CPMM math wrong, it'll be wrong about every downstream task.
The dataset is small and hand-curated — this is not a leaderboard scraper or a
ten-thousand-row crawl of Etherscan. The included seed examples are meant to be
illustrative; you can extend them with your own data via --data-path.
Installation
git clone
cd chainreason
pip install -e .
For local model inference (HuggingFace), also install:
pip install torch transformers accelerate
Quick start
export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5
Or run a full sweep from a YAML config:
python scripts/run_eval.py --config configs/full_run.yaml
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md
Programmatic use
from chainreason.tasks import get_task
from chainreason.models.openai_client import OpenAIClient
from chainreason.runner import run_eval
task = get_task("vuln_detect")
model = OpenAIClient(model="gpt-4o-mini")
summary = run_eval(task, model, limit=10, output_dir="results/")
print(summary["metrics"])
Tasks
| Task | n (seed) | Output type | Metric |
|------|----------|-------------|--------|
| protocol_qa | 14 | A/B/C/D | accuracy |
| vuln_detect | 12 | label (1 of 6)