One GPU, Ten Developers, $10/Month Each: Inside sllm’s Shared Inference Gamble

Running open-source models on cloud GPUs costs real money. A single H100 on-demand runs $2-7/hour depending on the provider. Dedicate one to serving DeepSeek V3 (685B parameters) and you’re looking at roughly $14,000/month. Even smaller setups for 70B-class models land in the $500-2,000/month range. For individual developers and small teams experimenting with local AI, that’s a brutal number.

sllm’s answer: don’t rent a GPU alone. Split it.

The platform launched on Hacker News as a Show HN post on April 4th and pulled 132 points — a strong signal that GPU cost pain is universal enough to make people pay attention to a completely unproven model.

How the Co-Op Model Works

The concept is dead simple. You pick a model, pay $10-40/month, and get assigned to a “cohort” of 5-10 developers sharing the same GPU node. The backend runs vLLM with continuous batching to juggle concurrent requests. The API is OpenAI-compatible — swap your base URL, keep everything else the same.

What you get:

Throughput: 15-35 tokens/second (varies by load and model size)
TTFT: Average under 2 seconds, worst case 10-30 seconds during peak
Privacy: No traffic logging. Your prompts stay your prompts.
Billing: Flat monthly fee. No per-token metering, no surprise bills.

Users are matched by timezone so peak usage hours don’t overlap for everyone simultaneously. It’s the same logic behind gym memberships — the business model only works because not everyone shows up at 6 PM.

Supported models include Llama-4-Scout (109B), Qwen-3.5-122B, GLM-5-754B, Kimi-K2.5-1T, DeepSeek V3, and DeepSeek R1. The bigger the model, the bigger the cohort needed and the higher the monthly price. DeepSeek R1 runs $40/month. DeepSeek V3 requires cohorts of roughly 465 users to make the economics work.

The Pricing Math — and Where It Gets Tricky

At face value, the savings look enormous. Let’s compare:

Option	Monthly Cost	What You Get
AWS H100 (on-demand)	~$5,000+	Dedicated, full control
Lambda Labs H100	~$2,100	Dedicated, ML-focused
Vast.ai / Spot	~$1,500-2,000	Variable availability
OpenRouter / DeepInfra (API)	Usage-based	Pay per token, scales with demand
sllm	$10-40	Shared, throughput-limited

The 50-100x price reduction is real, but it comes with asterisks.

First: cohorts have to fill before they activate. As of launch day, no cohorts had filled yet. The founder announced a 7-day cancellation window if your cohort doesn’t reach critical mass — so you won’t get stuck paying for nothing. But it means you might sign up, wait a week, and get refunded instead of getting GPU access.

Second: throughput is shared, not guaranteed. vLLM’s scheduler does best-effort allocation, not hard resource partitioning. If four people in your cohort decide to run heavy batch jobs at the same time, everyone’s tokens/second drops. This is the classic “noisy neighbor” problem that haunts every shared infrastructure play.

Third: the per-token economics might not be as dominant as they look. HN commenters did the math — at 465 users sharing a DeepSeek V3 node at 20 tokens/second average, the system realistically supports about 150 concurrent users. If you’re a light user who sends a few hundred requests a day, sllm is a steal. If you’re running an agent loop that hammers the API 24/7, the shared throughput ceiling becomes a real constraint. At that point, per-token API pricing on OpenRouter or Together AI might actually be cheaper.

What Hacker News Actually Thinks

The HN thread is a masterclass in the community doing back-of-napkin infrastructure math in real time.

The bulls see sllm solving a genuine gap. Cloud GPU pricing is designed for companies, not individuals. API providers charge per token, which scales linearly with usage. There’s no “Netflix for inference” — a flat fee for unlimited access to open-source models. sllm is the closest thing to that idea, and for hobbyists, researchers, and developers building prototypes, $10-40/month for unlimited tokens is an obvious yes.

The bears point to execution risk. The cohort model creates a chicken-and-egg problem: the service doesn’t work until enough people sign up, but people won’t sign up for a service that doesn’t work yet. The “no hard allocation guarantees” issue is a dealbreaker for anyone who needs predictable latency. And the economics scale awkwardly — small models need fewer users per cohort (easier to fill), but large models like DeepSeek V3 need hundreds of users per cohort (much harder to fill, and the service is most valuable for exactly those expensive models).

The founder (@jrandolf) made a fair counterpoint: most developers don’t actually use GPUs 24 hours a day. You send a prompt, read the output, think, edit your code, send another prompt. Actual GPU utilization per user is bursty, not continuous. Combined with timezone-based matching, the real concurrent load is a fraction of the total cohort size.

It’s a reasonable argument. It’s also the exact same argument every shared-resource startup has ever made — from co-working spaces to shared scooters. Sometimes it works. Sometimes the unit economics fall apart at scale.

sllm vs. the Alternatives

The competitive landscape depends entirely on your usage pattern:

If you use < 1M tokens/day: sllm at $10-40/month crushes everything. OpenRouter would charge you based on model-specific per-token rates. Together AI, DeepInfra, and Fireworks all bill by usage. For light-to-moderate workloads on open-source models, sllm’s flat fee is unbeatable — if the cohort fills.

If you use 10M+ tokens/day: You’re probably better off with a dedicated GPU on Lambda Labs or Vast.ai, or negotiating volume pricing with an API provider. sllm’s shared throughput ceiling will bottleneck you before the monthly savings matter.

If you need guaranteed latency: sllm is not for you. The 10-30 second worst-case TTFT is unacceptable for production applications. Stick with dedicated inference or providers like Fireworks that offer latency SLAs.

If you want the widest model selection: OpenRouter gives you 400+ models from 60+ providers through one API. sllm currently supports a handful of open-source models. The breadth isn’t comparable.

The real competitor isn’t cloud GPUs or API providers — it’s the question of whether the “gym membership” model works for GPU compute. Gyms work because treadmills are expensive, usage is bursty, and most people skip Mondays. GPU inference has similar dynamics. The open question is whether sllm can reach the critical mass needed to keep cohorts filled and throughput acceptable.

For $10/month, it’s a low-risk bet for any developer who wants unlimited access to Llama-4 or Qwen-3.5 without watching a token counter tick up. The worst case is a refund. The best case is that shared GPU economics actually work — and the entire “pay per token” model starts looking like the overpriced default it might be.

Top AI Product

Leave a comment Cancel reply