We ran 21,000 agentic simulations. A blend of open-source models matched the top performers.
Most teams build around one model. Find the best, lock in, optimize around it. We tested the opposite: what happens when you route across multiple cost-efficient open-source models instead?
On over 21,000 simulations of one of the hardest agentic benchmarks available, that blend scored 90.9%, within 7 points of the single best model and ahead of every other model family we tested.
What we ran and what we found
Tau2-Bench from Sierra Research is a multi-turn agentic benchmark where models act as autonomous customer-service agents. They look up accounts, process payments, chain multiple tool calls in sequence, and follow a 5,000-word policy doc across full conversations. Binary pass/fail. Every candidate ran 10 complete passes across all 114 tasks (1,140 simulations each).
One of the instruments we tested is Text Prime. On The Grid, Text Prime is a standardized instrument designed for high-quality reasoning and complex tasks.
Instead of choosing a specific model from a specific provider, you purchase Units of the instrument. Each instrument defines a performance specification, including minimum intelligence, latency, and throughput requirements.
Multiple models and suppliers can serve requests for that instrument as long as they meet the specification. The system routes requests to qualifying suppliers behind the API, so developers interact with a consistent performance tier rather than managing individual models.
We wanted to see how that approach holds up against individual top models on a hard agentic benchmark.
Rank Model Provider Success Rate
---- -------------------- ---------------- ------------
1 GLM-4.7 Fireworks 97.9%
2-5 GLM-4.7 4 other providers 94.3–97.8%
6 Text Prime (blend) 6 providers 90.9%
7 Kimi-K2.5 Fireworks 86.1%
8 Kimi-K2.5 DeepInfra 84.4%
9 Kimi-K2.5 Together AI 83.2%
10 Kimi-K2 Thinking Novita 82.5%
... ... ... ...
19 MiniMax-M2P1 Fireworks 7.6%Text Prime is a weighted blend of frontier open-source models routed across 6 inference providers. Some of those individual models score in the low 80s on their own. Blended and weighted toward quality, the ensemble hits 90.9%, ahead of every Kimi-K2.5 deployment and every other model family tested.
What this means
Model performance is commoditizing at the top.
Look at the table. GLM-4.7 scores between 94% and 98% across six different providers. Kimi-K2.5 scores between 82% and 86%. Text Prime, a routed blend, sits at 90.9%. The top 10 out of 19 configurations all score above 82% on a benchmark that involves tool chaining, policy adherence, and multi-turn state tracking. The frontier isn't one model pulling away from the pack. It's a cluster of capable models in the same performance neighborhood.
Open-source is ready for hard agentic workloads.
Every model inside Text Prime is open-source. A 90.9% success rate on binary-graded agentic tasks, the kind that require multiple sequential tool calls and strict policy compliance, puts open-source firmly in production territory. The gap between open-source and proprietary on real-world agent tasks is no longer a question of "if" but "how small."
When quality converges, the performance vs. cost trade-off becomes the real decision.
Agents don't make one inference call. They loop. They reason, call tools, re-read context, and retry. A single ticket resolution in this benchmark runs 5 to 15+ LLM calls. When the performance difference between a premium single model and a cost-efficient blend is 7 points on the hardest tasks, but the cost difference compounds with every call your agent makes, the trade-off becomes clear.
If the performance gap at the top is single digits and shrinking, the teams that win aren't the ones chasing the last few points on a benchmark. They're the ones optimizing the cost of reliable, high-quality inference at volume.
That's exactly what we're building at The Grid. The Grid is a spot market for AI inference.
Instead of buying access to a specific model from a specific provider, you buy Units of standardized Instruments like Text Prime that guarantee a minimum level of intelligence, latency, and throughput. Units are fungible across suppliers and models, as long as they meet the spec. Prices are formed through real-time supply and demand, and multiple suppliers compete to serve your workloads.
You consume inference through the standard chat completions API, the same way you already call any model today. Not picking winners in a model race, but treating inference as the commodity it's becoming.