Sandbox

Price the shape of a pull request.

A worked example: one model upgrade, priced by the engine. The example diff below is fixed and read-only; the report beside it is computed in your browser on page load by a JavaScript port of the Python cost_engine. The same method runs on your real diff in CI.

Example pull request

A fixed example — not editable. The engine reads the changed lines for model, max_tokens, retries, and fallbacks, then runs the paired CRN Monte Carlo.

@@ app/agent.py: model upgrade @@
-        model="gpt-4o-mini",
-        max_tokens=1024,
+        model="gpt-4.1",
+        max_tokens=8192,
+        max_retries=5,
+        fallback_model="gpt-4o",

Workload basis: 100,000 requests/mo · 1 LLM call/request · 2,400 input tokens/call · $500 P90 budget.

How to read the example

The browser shares the engine's math; CI runs it on your actual diff.

The engine reads the levers that usually move AI cost out of the changed lines - model swap, output ceiling (max_tokens), retry count, fallback path, tool schemas - then runs a paired baseline/PR Monte Carlo on 20,000 seeded months. The chart is the distribution of the monthly cost DELTA, not an absolute bill.

The browser engine ports the default-path pricing formula of the Python cost_engine simulate(), and a CI parity test asserts the demo's mean stays within 1.5% and its P90 within 5% of the reference engine on identical inputs - so the simulation MATH matches CI. The absolute dollars differ: CI prices against the full ~6,900-model frozen basis and builds scenarios from signals read out of your real files, and the production engine adds structured/cache pricing, capability-derived retries, rework, and cloud terms this example does not exercise. What transfers is the method and the shape, not the exact numbers.

The example lets a buyer feel the approval problem in thirty seconds: the same feature can be acceptable at expected cost and unacceptable at P90 after retries or output growth. That is the moment the budget gate becomes intuitive.

From browser to CI

The same engine, two surfaces.

The demo above is the browser port. In CI, the identical method runs on the real diff via estimate_pr: it reads the actual file versions, builds paired baseline and PR scenarios, runs the Python Monte Carlo with common random numbers, and writes the machine-readable gate payload. A parity test (tests/test_js_engine_parity.py) keeps the two honest.

python -m takeoff.estimate_pr \
  --base origin/main \
  --head HEAD \
  --json gate.json

What the CI run adds beyond this example

The browser example shares the engine math; production adds the rest.

Your real diffCI reads the actual file versions through estimate_pr(file_versions) and scans them, instead of pricing this one fixed example.

The full price basisCI prices against the frozen ~6,900-model basis and the model-fit recommendation, not the small illustrative table this example uses.

The whole cost modelStructured/cache pricing, capability-derived retries, human-rework, and cloud/fixed terms that the browser example does not exercise.

The gate payloadCI writes the machine-readable policy JSON and posts the PR comment, including estimate class, verdict, BOE, and footprint block.

Example questions

How evaluators should interpret the browser numbers.

Does any code get uploaded?

No. This is a fixed example computed entirely in your browser by a JavaScript port of the engine. The production path reads your real file versions through the Python takeoff engine in your own CI.

Should the curve be a normal distribution?

No. Class1 uses standard-normal draws as the random source, then maps P50/P90 risk factors into lognormal variables. Tokens, retries, context growth, and spend cannot go negative, and the business risk is the right tail.

Why is P90 usually much higher than P50?

The tail compounds systemic factors: output growth, retry pressure, context expansion, fallback rate, and demand spikes. Averages hide that approval risk.

What should I do when the sandbox fails the budget?

Treat it like the real gate: reduce output, cap retries, shrink context, lazy-load tools, choose a better-fit model, or bring in the budget owner.

What makes the real CI run stronger?

CI uses the actual diff, paired baseline/PR scenarios, common random numbers, the frozen price basis, model-fit recommendation, BOE, footprint block, and machine-readable gate payload instead of this simplified browser setup.