LLM Tool-Selection Benchmark

North-star: CPC (Cost Per Correct) = total run cost / tasks with correct retrieval strategy  ·  Priority: Quality → Latency → Dollar cost
OpenAI
Anthropic
Ollama (local, $0)

llm-token-harness — Hardened Run Insights (June 12, 2026)

Run: 23-task hardened dataset, 12 configs, single run. North-star: CPC = run cost / tasks solved. Caveat up front: n=23, one run, no variance. Treat config-vs-config gaps of 1–2 tasks as noise; treat the structural patterns (which axes break models, which distractor gets picked) as the real signal. The June-10 grid was a different 15-task dataset and is not directly comparable score-for-score — it's only useful for "did the spread widen."


1. Executive summary


2. The CPC story — the cost/quality frontier

The full ladder (sorted by CPC)

Config Score/23 Total cost CPC Mean latency
gpt-4o-mini13$0.00523$0.0004021.83s
gpt-5.4-nano16$0.00762$0.0004761.25s
gpt-5.4-mini17$0.02757$0.0016211.56s
claude-haiku-4-515$0.06979$0.0046533.45s
claude-sonnet-4-617$0.21261$0.0125073.44s
gpt-5.518$0.22715$0.0126194.16s
claude-opus-4-618$0.36116$0.0200644.26s
claude-opus-4-816$0.41047$0.0256542.99s
Fable 5 (low)19$0.73438$0.0386524.56s
Fable 5 (medium)19$0.76228$0.0401204.98s
Fable 5 (high)18$0.79508$0.0441715.25s
gemma4:12b9$0.00$0.00 (local)28.5s

The Pareto frontier (maximize score, minimize CPC)

Ignoring gemma (free but slowest by 6× and lowest accuracy — a separate "local/zero-marginal-cost" regime), the non-dominated configs are:

  1. gpt-4o-mini — cheapest ($0.000402 CPC), anchors the low end at 13/23.
  2. gpt-5.4-nano — $0.000476 CPC, 16/23. Only ~18% more CPC than 4o-mini for +3 tasks.
  3. gpt-5.4-mini — $0.001621 CPC, 17/23. +1 task over nano at ~3.4× CPC.
  4. Fable 5 (low) — $0.038652 CPC, 19/23 (the top score). The only way to buy the 18th and 19th correct task, at a steep premium.

Everything else is dominated: some frontier config matches or beats its score at lower CPC.

"$/solved-task vs $/token," made concrete

The 18–19/23 tier is where the debate lives. Four configs land there:

Config Score CPC ($/solved) Output tok
gpt-5.518$0.0126192542
claude-opus-4-618$0.0200644062
Fable 5 (high)18$0.0441713800
Fable 5 (low)19$0.0386522586

A buyer optimizing $/token would rank by output volume and might pick gpt-5.5 or pay up for a flagship's tokens. A buyer optimizing CPC sees that gpt-5.5 delivers 18 correct tasks for 3.5× less per-correct than Fable-low's 19, and that the marginal cost of Fable's one extra correct task is enormous. CPC turns "which is cheaper per token" into the decision that actually matters: how many dollars to get the job done right. That is the argument the blog is making, and the nano-vs-Opus-4.8 row is the cleanest single proof of it.


3. What discriminates models now — ranking the axes

Using task_difficulty (pass count across 12 configs), the hardest tasks cluster by axis. Lower pass count = sharper discriminator.

Task n_pass/12 Axis
halverson_dispute_021Parallel invocation
vendor_renewal_decompose_012Query decompose
vendor_autorenew_023Long chain / constraint filter
easton_amendment_024 (alt)Chain, non-adjacent state
halverson_dispute_014Long-chain entry
halverson_dispute_055 (alt)Long chain
vendor_autorenew_045Chain / filter
(baseline floor + easy chains)8–12

Ranked by spread created:

  1. Parallel invocation (axis 4) — by far the sharpest. halverson_dispute_02: 1/12 pass. The failure mode is uniform and diagnostic: 8 configs collapsed the task into two list_documents calls (Fable all 3 efforts, opus-4-8, gpt-5.4-mini, sonnet, gpt-5.5, opus-4-6 partially). Only gpt-5.4-nano fanned out into 4 distinct search calls and matched both required specs. gemma emitted no calls. This isn't a near-miss — it's a clean capability line.
  1. Query decompose — second sharpest, and possibly a grading artifact (see §6). vendor_renewal_decompose_01: 2/12 pass — and only Sonnet 4.6 and Opus 4.6 got it. Note nano, gpt-5.5, all Fable efforts, and opus-4-8 all failed it. When the only two passers are an adjacent-generation Anthropic pair and the newer, nominally-stronger models miss it, suspect the expected answer is narrowly specified rather than the task being "hard." Flag for adjudication.
  1. Long chains with non-adjacent state (axis 1). The halverson_dispute 5-step chain is the graveyard: gpt-4o-mini and gemma went 0/5; the best (nano, gpt-5.4-mini, gpt-5.5, opus-4-6, all Fable) max out at 3/5. No config solved the full Halverson chain. vendor_autorenew (4 steps) similarly tops out — only Fable low/med hit 4/4. Chains spread the field but more gradually than parallel.
  1. Constraint-dense filters (axis 2). Hardest to isolate because filters are embedded in chain steps (vendor_autorenew_02/04, 3 and 5 passes). They contribute to the chain attrition above rather than standing alone. Moderate discriminator.

The baseline floor held: 6 of the easy floor/early-chain tasks scored 12/12 (corvid_fetch_01, nda_inventory_01, q1_2025_inventory_01, nonsolicit_search_01, termination_topk_01, easton_amendment_01/03). Single-turn selection from a small tool set remains saturated — exactly the prior finding, now confirmed as the floor while the hard axes do the work.


4. Behavioral findings

Distractor susceptibility — it's all summarize_document

Decompose vs. parallel — a behavioral split

The data hints at two different "break complex query into pieces" instincts:

Effort-knob economics (Fable 5)

Effort Score Output tok Total cost CPC
low192586$0.73438$0.038652
medium193144$0.76228$0.040120
high183800$0.79508$0.044171

The knob now scales cost monotonically (output tokens +47% low→high) but not quality. low=medium on score; high regressed one task (lost vendor_autorenew_04, kept everything else). On June-10 the knob was a pure no-op (low=med=high=10). So it went from "does nothing" to "spends more for the same-or-worse result." On this dataset, Fable 5 (low) strictly dominates medium and high — there is no reason to turn the knob up. Worth flagging to whoever owns Fable's pricing/effort story.


5. Surprises / counterintuitive results


6. Caveats & threats to validity


7. Blog-angle suggestions

  1. "Cost per correct, not cost per token: how a $0.0005 nano beat a $0.026 flagship." Lead with the gpt-5.4-nano vs Opus-4.8 row — same score, 54× CPC gap. It's the single cleanest number in the dataset and it is the $/solved-task argument. Honest: same score, so it's a pure efficiency win, not a quality claim.
  1. "The knob that spends but doesn't deliver: Fable 5's effort dial in numbers." low/med/high = 19/19/18 on rising cost; high regressed. A concrete, slightly contrarian take on reasoning-effort controls — buy the cheapest effort tier until proven otherwise. (Caveat the reasoning-token gap so it's not overclaimed.)
  1. "What actually separates frontier models now: parallel tool-calling." 1/12 on the parallel task; the uniform "collapse to two list_documents" failure is a vivid, reproducible story about a specific missing competence — and nano being the lone solver subverts the "bigger model wins" expectation. The strongest capability (not just cost) narrative in the run.

A fourth, softer angle if wanted: "Two ways to plan, and no model does both" — the decompose-vs-parallel split — but only run it if §6's grading-artifact suspicion on the decompose task clears adjudication.

Dataset Hardening Roadmap — Round 2

Grounded in the 2026-06-12 run (23 tasks × 12 configs). Source: /tmp/llm-harness-viz/bench_data.json, data/tasks/search_agent_v1.json, src/scorer.py, src/tools.py.

Scoreboard for reference (score / 23): Fable 5 low+med 19, gpt-5.5 / opus-4-6 / gpt-5.4-mini / sonnet-4-6 / Fable-high 17–18, gpt-5.4-nano / opus-4-8 16, haiku-4-5 15, gpt-4o-mini 13, gemma4:12b 9. Spread 9→19 (was 10→14 on the June-10 15-task grid). Hardening worked; now we trim dead weight and add new discriminating axes.


1. Diagnosis from this run

Saturated — passed by all 12 (n_pass == 12): dead weight as discriminators

Near-saturated (11/12, single straggler = gemma or nano): vendor_autorenew_01, vendor_autorenew_03, vendor_indemnification_2024_01, okafor_leases_01.

Recommendation.

Discriminating well (mid-band n_pass 4–9, splits the frontier models)

These are the gold. The mid-band is where CPC means something. Note stonebridge_term_01 is tagged is_floor but behaves like a mid-band discriminator (8/12) — the fetch-vs-search trap is harder than its "floor" label suggests. It also drew summarize_document picks from 4o-mini, sonnet, gemma. Keep it; arguably re-label it as discriminating, not floor.

At the floor — passed by ≤2 (too hard, or mis-graded?)

Bottom line: 7 saturated (keep 5 floor, retire 2 chain-padding), ~10 discriminating, 2 at floor (1 grading artifact in parallel, 1 likely mis-grade in decompose). Net: the dataset is healthier than the raw spread suggests, but vendor_renewal_decompose_01 should be adjudicated before the next run.


2. Where the current axes have headroom

Parallel invocation (halverson_dispute_02): 1/12 — discriminator or artifact?

Both. Look at the actual call sequences in parallel_task.per_config:

Verdict: the parallel axis is a good discriminator but the single task is over-tuned on keyword choice. The keyword "indemnification" is too narrow — the decompose output (step 2's tool response) says "indemnification and limitation of liability", so a model that queries ["halverson","liability"] is equally correct and scores 0. Fix: loosen step-02 expected keywords to the counterparty only (["halverson"] / ["apex"]), or add expected_alternatives-style per-spec tolerance. (Note: expected_parallel does not support alternatives today — see §4 code note.) A second, cleaner parallel task (Axis A) will tell us whether 1/12 is the task or the axis.

Chains: discriminating well

halverson_dispute (5 steps) is the best axis in the set — correct/total ranges from 0/5 (gpt-4o-mini, gemma) to 3/5 (frontier). Nobody clears it; the non-adjacent state dependency (step 4 needs doc_07 surfaced two turns back, step 5 needs a never-surfaced rider) is exactly the intended difficulty. vendor_autorenew also splits well (0/4 gemma → 4/4 Fable). easton is the weakest chain because half its steps are saturated (§1). Keep all three; tighten easton.

Near-miss distractors: half-working


3. Proposed NEW hardening axes

Five axes, prioritized in §5. Every example is authorable in the current schema unless a code note says otherwise.

Axis A — Longer parallel batches (3–4 distinct calls)

Capability: issuing 3+ genuinely independent retrievals in one turn without serializing or over-calling. Real search agents (LangChain RunnableParallel, LlamaIndex SubQuestionQueryEngine, Anthropic parallel tool-use) fan out N independent sub-queries in a single turn; 2-way is the easy case, 3–4-way is where models start serializing or dropping a leg. Why it discriminates: the existing 2-way task already split the field; widening to 3 specs raises the injective-match bar and punishes "issued 2 of 3" — which parallel_matched/parallel_failed_specs already records for free. Difficulty/spread: expect 0–3 / 12 passing. Likely only nano + one or two frontier configs. Strong top-end discriminator. Code changes: none for scoring (_score_parallel already does N-way bipartite matching). But see §4: expected_parallel can't carry per-spec alternatives, so keep keywords to the counterparty token only to avoid the §2 over-tuning trap.

Example task (new standalone, not a chain — keeps it simple to author):

{
  "task_id": "tri_counterparty_parallel_01",
  "scenario_id": "tri_counterparty_parallel",
  "step": 1,
  "description": "Three independent counterparty lookups; correct move is three parallel searches in one batch. Issuing fewer, or serializing, scores 0.",
  "messages": [
    {"role": "system", "content": "<<standard contract-intelligence system prompt>>"},
    {"role": "user", "content": "Board prep: I need the termination-for-convenience clause from three separate agreements at once — the Halverson MSA, the Apex Components MSA, and the Corvid MSA. Pull them in parallel; don't make me wait on three round-trips."}
  ],
  "expected_parallel": [
    {"tool": "search", "args": {"query": {"type": "keywords", "value": ["halverson", "termination"]}}},
    {"tool": "search", "args": {"query": {"type": "keywords", "value": ["apex", "termination"]}}},
    {"tool": "search", "args": {"query": {"type": "keywords", "value": ["corvid", "termination"]}}}
  ],
  "scoring_weights": {}
}

Keywords are counterparty + one robust topic token only — avoids the indemnification-vs-liability artifact from §2.

Axis B — Content-dependent multi-hop (step N keyed on content returned at step N-2)

Capability: carrying a concrete value (a doc title, a section number, a counterparty name) out of a document's body text — not its metadata — and using it as the anchor for a later retrieval. vendor_autorenew_04 already does a one-hop version (Kestrel's body names the "Master Vendor Program Agreement"). This axis makes it a two-hop non-adjacent dependency: the value needed at step N was buried in the body of a doc fetched at step N-2, and nothing in steps N-1 or the user turn repeats it. Why it's a real skill: agentic RAG (self-RAG, ReAct-style retrieval loops) routinely chases cross-references — "see Schedule C", "governed by Section 9.4 of the MSA", "as defined in the Master Agreement". The model must read, extract the pointer, and re-retrieve. This is the single most common failure mode in production contract agents. Difficulty/spread: expect 3–6 / 12. This is the mid-band sweet spot — frontier models hop, weaker models re-search the wrong anchor or summarize. Code changes: none. Standard single-call expected with keywords.

Example (replaces saturated easton_amendment_03, slots into the easton chain as a content-hop step). Suppose the amendment body (doc_20) references "the Pinehurst Master Lease Framework" governing all the landlord's leases:

{
  "task_id": "easton_amendment_03b",
  "scenario_id": "easton_amendment",
  "step": 3,
  "description": "The amendment body cross-references the 'Pinehurst Master Lease Framework' as governing the renewal mechanics. That framework has never been surfaced and its doc_id is unknown; the only anchor is the name appearing in doc_20's body two steps back. Correct move: search anchored on that framework name.",
  "messages": ["<<history through fetching doc_20, whose content names the 'Pinehurst Master Lease Framework'>>"],
  "expected": {
    "tool": "search",
    "args": {"query": {"type": "keywords", "value": ["master lease framework"]}}
  },
  "scoring_weights": {}
}

Distractor pull here: summarize_document(doc_20) (re-summarize instead of chasing the reference) and re-fetching doc_19/doc_20 (re-reading what you've already got).

Axis C — Cost-trap / over-calling tasks (a cheap single move suffices)

Capability: restraint. The correct answer is one cheap, targeted call; the trap is to fan out, decompose, or list-then-search when the user already pinned the answer down. CLAUDE.md's north star is CPC — a model that gets the right answer via 4 calls when 1 was right is wasting the denominator's worth of tokens. We currently reward correctness but never penalize over-calling, even though it's the core cost story. Why it's a real skill: "minimal tool use" is an explicit objective in agent frameworks (OpenAI's tool-use guidance, Anthropic's "don't over-tool" guidance). Over-calling is the dominant cost driver in production. Difficulty/spread: with all-or-nothing scoring as-is, a cost-trap only discriminates if the over-call produces a wrong tool (then it already fails) — so to make it bite under V1 scoring, design it so the tempting expansion is a distractor or a different tool. Expect 5–9 / 12. Code changes: none for V1 (lean on tool-name mismatch). For V2, a real cost-trap wants a "calls ≤ N" graded signal — see §4. For now, actual_tools already records batch length, so over-call rate falls out as an unscored behavioral metric (like distractor-pick rate) even without changing the score.

Example (single call should win; decompose is the trap):

{
  "task_id": "single_clause_costtrap_01",
  "scenario_id": "single_clause_costtrap",
  "step": 1,
  "description": "Looks multi-part ('and') but is one narrow content lookup in one named doc family. The cheap move is a single filtered search. query_decompose is the over-engineering trap; it's not wrong-tool but it burns a turn — scored via expected=search, and decompose scores 0.",
  "messages": [
    {"role": "system", "content": "<<standard system prompt>>"},
    {"role": "user", "content": "Real quick — across our NDAs, what's the standard confidentiality survival period after termination? Just the survival clause language, nothing fancy."}
  ],
  "expected": {
    "tool": "search",
    "args": {
      "query": {"type": "keywords", "value": ["survival"]},
      "filters": {"type": "exact", "value": {"type": "nda"}}
    }
  },
  "expected_alternatives": [
    {"tool": "search", "args": {"query": {"type": "keywords", "value": ["confidentiality", "survive"]}, "filters": [{"name": "filters", "match_type": "exact", "value": {"type": "nda"}}]}}
  ],
  "scoring_weights": {}
}

(Author the alternative with the same expected arg shape as the primary; shown loosely above.) Track batch-length of the passing configs from actual_tools to report over-call behavior even though the score is binary.

Axis D — Convert the dead search_history distractor into a live trap

Capability: distinguishing "search my prior interactions" from "read the conversation already in front of me." search_history drew zero picks — it's wasted. A task that explicitly says "what we looked at earlier" makes search_history the tempting move, while the correct move is either to answer from in-context history or to re-issue a corpus search/get_document. Why it's a real skill: in stateful agents, conversation context and a "session history" tool are genuinely confusable; picking the tool when the answer is already in the message window is a real and costly error. Difficulty/spread: expect 6–10 / 12 — weaker/cheaper models more likely to grab the shiny history tool. Resurrects a dead distractor into a discriminator. Code changes: none.

Example (mid-chain, where the prior result is already in context):

{
  "task_id": "halverson_recall_01",
  "scenario_id": "halverson_dispute",
  "step": 6,
  "description": "User asks to re-confirm a figure from a document already fetched earlier in THIS conversation (doc_07, Section 9.2). The answer is in context; the correct retrieval move if a re-read is wanted is get_document(doc_07). search_history is the trap (it searches prior sessions, not this corpus and not this in-context history).",
  "messages": ["<<halverson history through step 5; user now asks 'remind me exactly what the 9.2 cap said in the Halverson MSA we pulled earlier'>>"],
  "expected": {
    "tool": "get_document",
    "args": {"doc_id": {"type": "exact", "value": "doc_07"}}
  },
  "scoring_weights": {}
}

Watch distractor_picks for search_history to confirm the bait now fires.

Axis E — Adversarial / ambiguous filter language (filter precision under linguistic noise)

Capability: mapping fuzzy human date/scope language onto exact filter boundaries. The current constraint-dense tasks use clean phrasing ("in 2024", "first quarter of 2025"). Real users say "since last spring", "the back half of last year", "before we renewed Easton", "everything newer than the Apex deal." The model must resolve these to start_date/end_date boundaries — and a one-day boundary error fails under exact matching. Why it's a real skill: date-range resolution is the #1 filter bug in production search agents; Azure AI Search / Elastic range queries are unforgiving about boundaries. Difficulty/spread: expect 3–7 / 12 — boundary-off-by-one and "did they mean inclusive?" splits the field hard. Caveat: relative dates need a fixed "today" anchor. The system prompt already implies a present; pin it explicitly in the task's system message ("Today's date is 2026-06-12.") so the expected boundary is deterministic. Avoid genuinely ambiguous phrasings where two boundaries are both defensible — that becomes a mis-grade, not a discriminator. Prefer phrasings with one correct resolution. Code changes: none (uses existing exact filter matching). Authoring discipline only.

Example:

{
  "task_id": "h2_2024_relative_01",
  "scenario_id": "h2_2024_relative",
  "step": 1,
  "description": "Relative date phrasing ('the back half of 2024') must resolve to start_date 2024-07-01 / end_date 2024-12-31. Date anchor pinned in system prompt. Exact-match on the boundary; an off-by-quarter or inclusive/exclusive slip fails.",
  "messages": [
    {"role": "system", "content": "<<standard system prompt>> Today's date is 2026-06-12."},
    {"role": "user", "content": "Give me every agreement we executed in the back half of 2024 — titles and dates only."}
  ],
  "expected": {
    "tool": "list_documents",
    "args": {"filters": {"type": "exact", "value": {"start_date": "2024-07-01", "end_date": "2024-12-31"}}}
  },
  "scoring_weights": {}
}

Axes considered and rejected (pressure-test)


4. Scoring evolution

All-or-nothing is still right for V1 and for every axis A–E above — each is authorable as a clean binary pass/fail. Specific notes:

  1. expected_parallel needs per-spec alternatives (small V1-eligible fix). §2 showed the 2-way parallel task over-fits on indemnification vs liability. Today a parallel spec's args are matched strictly; there's no expected_alternatives for parallel (and _parse_task actively forbids it). The cheap fix is authoring discipline — keep parallel keywords to a single robust token (counterparty + topic) so there's only one reasonable phrasing. A code fix (allow each ExpectedCall in expected_parallel to carry alternative keyword sets) is a ~20-line scorer change but is optional and can wait. Recommend authoring-discipline for June-21.
  1. Cost-trap (Axis C) and over-call behavior want a graded signal — but that's V2. Partial credit is explicitly backlogged (design decision 4). For now, TaskResult.actual_tools already records batch length and call order, so over-call rate and decompose-invocation rate fall out as unscored behavioral metrics exactly like distractor-pick rate. Report them in the viz; don't change the score. If/when V2 adds partial credit, the natural shape is a calls-budget penalty (score = correct ? max(0, 1 - λ·(n_calls - min_calls)) : 0) — frame for V2, don't build now.
  1. vendor_renewal_decompose_01 is a scoring fix, not a new axis — add an expected_alternatives entry for the list_documents(type=vendor_agreement) first move, or rewrite the query to genuinely require decomposition (span two doc families with a cross-family comparison, like halverson_dispute_01 does). This is adjudication per design decision 10, and it should land before the next run regardless of which axes ship.

5. Prioritized next steps (toward ~June-21 post)

Ordered by discrimination-per-effort. Effort in rough half-day units; all are authoring-only unless noted.

# Action Effort Why first
1Adjudicate vendor_renewal_decompose_01 (add list_documents alternative or rewrite). Loosen halverson_dispute_02 parallel keywords to counterparty-only.0.5dFixes two grading artifacts currently understating the dataset's fairness. Zero new code. Must-do before re-run.
2Axis B — content-dependent multi-hop, replacing saturated easton_amendment_03 with 03b (and optionally retiring easton_amendment_01).0.5dHighest discrimination-per-effort: mid-band, reuses an existing chain, kills dead weight while adding signal. Pure authoring.
3Axis A — one 3-way parallel task (tri_counterparty_parallel_01).0.5dTests whether parallel 1/12 was the task or the axis; _score_parallel already handles N-way, so zero code. Strong top-end split.
4Axis D — search_history bait task (halverson_recall_01).0.5dConverts a dead distractor into a live one; cheap; gives the viz a new distractor-pick story.
5Axis E — 1–2 relative-date filter tasks with pinned date anchor.0.5dGood mid-band discriminator; only risk is authoring an ambiguous boundary — keep phrasings single-resolution.
6Axis C — one cost-trap task + report over-call rate from actual_tools in the viz.0.5dTies directly to the CPC north star; binary-scoreable in V1 via tool-name mismatch; behavioral metric is free.
7(V2, not June-21) Abstention/clarification mode: new expected_no_call task type + scorer branch.1.5d+Real axis, but net-new code AND new task type. Backlog.

June-21 scope: items 1–4 are the floor (2 days, all authoring, zero code) and already meaningfully sharpen the run. Items 5–6 if time allows. Item 7 is explicitly V2.

Re-run note: after items 1–4, re-run the full 12-config grid. Watch for (a) vendor_renewal_decompose no longer at floor, (b) the 3-way parallel landing 0–3/12 (validates the axis), (c) search_history finally drawing picks (validates Axis D), (d) easton chain correct/total spread widening now that 01/03 are gone or replaced.