joshawome/chainreason

352

+20/day

239

PythonAI/ML

A benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks

From the README

ChainReason

A small benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks.

ChainReason is a lightweight evaluation suite that asks language models to do five things a smart-contract engineer or DeFi analyst would consider routine:

protocol_qa — multiple-choice questions about specific DeFi protocol mechanics.
vuln_detect — classify a Solidity snippet by vulnerability category.
contract_class — classify a contract from its ABI summary + an optional hint.
tx_intent — given a sequence of decoded actions, infer the transaction's intent.
slippage_pred — given an AMM pool state and a swap, compute the output amount.

The point of having five tasks instead of one is that each one stresses a different capability — symbolic reasoning, code understanding, structural pattern recognition, numeric reasoning. A model that's strong on vuln_detect but weak on slippage_pred tells you something different than a model that's strong on both.

Why another benchmark

Existing benchmarks for Solidity / blockchain LLMs largely focus on either (a) code generation, or (b) vulnerability detection. ChainReason adds three other axes that I haven't seen consolidated elsewhere:

Protocol-level reasoning. Knowing what getReserves() returns is one thing; knowing what happens when you yank 30% of the reserves out of a Uniswap v2 pair is another.
Transaction-graph understanding. Telling a sandwich apart from a swap from an arbitrage requires looking at the structure of an execution trace, not just opcodes.
Numeric grounding. AMMs have closed-form pricing. If a model gets the CPMM math wrong, it'll be wrong about every downstream task.

The dataset is small and hand-curated — this is not a leaderboard scraper or a ten-thousand-row crawl of Etherscan. The included seed examples are meant to be illustrative; you can extend them with your own data via --data-path.

Installation

git clone 
cd chainreason
pip install -e .

For local model inference (HuggingFace), also install:

pip install torch transformers accelerate

Quick start

export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5

Or run a full sweep from a YAML config:

python scripts/run_eval.py --config configs/full_run.yaml
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md

Programmatic use

from chainreason.tasks import get_task
from chainreason.models.openai_client import OpenAIClient
from chainreason.runner import run_eval

task = get_task("vuln_detect")
model = OpenAIClient(model="gpt-4o-mini")
summary = run_eval(task, model, limit=10, output_dir="results/")
print(summary["metrics"])

Tasks

| Task | n (seed) | Output type | Metric | |------|----------|-------------|--------| | protocol_qa | 14 | A/B/C/D | accuracy | | vuln_detect | 12 | label (1 of 6)

View on GitHub