TL;DR

Developer @stevibe posted three informal runs of ollama run kimi-k2.6:cloud, hitting 77.9, 114.3, and 86.3 tok/s with TTFT under 1.2 seconds. Same model (Moonshot AI's 1T-parameter Kimi K2.6) served through OpenRouter caps at 71 tok/s on Cloudflare, and drops to 14 tok/s on Parasail. Small sample, real signal: Ollama's cloud proxy is currently the fastest public way to run K2.6.

What's new

Kimi K2.6 went GA on April 20, 2026, landing simultaneously on Kimi.com, the official API, the Kimi Code CLI, Cloudflare Workers AI, OpenRouter, and Ollama's cloud registry. Ollama's twist: you keep the local CLI you already know and just append :cloud to the tag. The model runs remote, your laptop stays cool, and the request path looks identical to a local ollama run call.

That convenience is the pitch. The new data point is that the convenience layer is also, right now, the fastest path.

The numbers

Three back-to-back runs against kimi-k2.6:cloud:

RunThroughputTTFT
177.9 tok/s979 ms
2114.3 tok/s788 ms
386.3 tok/s1117 ms

Mean ≈ 93 tok/s, TTFT ≈ 960 ms. Variance is real (≈37 tok/s spread) which tracks with how any shared inference backend behaves under variable load.

Comparison vs OpenRouter providers

Same K2.6 weights (in theory), very different serving stacks:

ProviderThroughput (tok/s)Gap vs Ollama mean
Ollama kimi-k2.6:cloud77.9 – 114.3 (≈93 avg)baseline
Cloudflare Workers AI71~24% slower
Moonshot AI (official)27~3.4× slower
NovitaAI27~3.4× slower
Parasail14~6.6× slower

Cloudflare is the only OpenRouter provider in the same order of magnitude. The rest sit 3–7× behind. Most surprising slot: Moonshot AI's own endpoint at 27 tok/s — the model's creators being outpaced by a third-party proxy is not the story anyone expected.

What Kimi K2.6 actually is

For context on what everyone is racing to serve: Kimi K2.6 is Moonshot AI's latest Mixture-of-Experts model, with 1 trillion total parameters but only 32 billion active per forward pass. That's why providers can push 70–100+ tok/s on it — compute cost at inference scales with the active set, not the total. The 1T pool gives the model its knowledge capacity; the 32B active keeps it moving.

It's a native multimodal agentic model (vision + text in, text out), built for long-horizon coding, UI/UX generation from design intent, and multi-agent swarm orchestration. Published benchmarks from the K2 family land at 65.8% on SWE-Bench Verified and 53.7% on LiveCodeBench — coding-heavy workloads are the target audience, and those are exactly the workloads where tok/s directly shapes developer experience.

Why it matters

For agentic coding and long-context workflows, throughput is the difference between a usable loop and a frustrating one. Kimi K2.6 is tuned for SWE-Bench-style tasks and ships with a 262,144 token context window — the kind of setup where you want tokens flowing fast, because a single multi-file review can easily emit 5–10K output tokens.

At 14 tok/s (Parasail), a 5K response is ~6 minutes. At 93 tok/s (Ollama), it's ~55 seconds. That's the difference between a tool you use and a tool you tab away from.

Caveats — read these before you believe the numbers

  • N = 3. Three samples per endpoint isn't a benchmark, it's a vibe check.
  • Time-of-day effect. Cloud inference backends batch aggressively; your 77 tok/s run and my 114 tok/s run depend on who else is queued when we hit send.
  • Quantization unknown. No provider publishes quant level per deployment. A fp8 host will beat a fp16 host on tok/s while shipping slightly different outputs. We don't know which dial each provider turned.
  • Prompt shape matters. Short prompts with low input/output ratios favor providers with low fixed overhead. Long-context calls (near the 256K ceiling) would reorder this leaderboard.
  • Geography. @stevibe's network path to Ollama may differ dramatically from yours.

Pricing context

OpenRouter lists K2.6 at $0.60 / $2.80 per million tokens (input/output). Ollama's cloud pricing is tied to an Ollama account rather than OpenRouter's marketplace, so direct cost-per-tok/s comparisons depend on which plan you're on. If you're optimizing cost-per-second-of-wait, Cloudflare and Ollama are the two endpoints worth timing yourself.

How to try it yourself

If you want to repro the benchmark:

ollama run kimi-k2.6:cloud
>>> /set verbose
>>> Write a 500-word essay on distributed systems.

Ollama's verbose mode prints eval rate (tok/s) and prompt eval duration (TTFT proxy) after each response. Run it 5–10 times at different hours and you'll have a more honest distribution than any single tweet.

What's next

Moonshot has already teased a K2.6 Code Preview on the Kimi blog, positioned as the successor coding model. If it lands on Ollama Cloud with similar throughput characteristics, the calculus for developers picking a provider gets very simple: pick the one with the fastest serving stack, because the weights are the same everywhere.

Nguồn: @stevibe on X, Ollama kimi-k2.6:cloud, OpenRouter K2.6, Cloudflare Workers AI.