Hermes 4 (Nous Research) Scores 96.3% on MATH-500 — and Refuses Almost Nothing

Most open-source models pick a lane: either they chase benchmark scores, or they minimize content restrictions. Nous Research is betting it can do both at the same time. Hermes 4, their latest open-weight model family spanning 14B to 405B parameters, posts competitive math and reasoning scores while achieving the lowest refusal rate of any high-performance model tested. The 405B variant hits 96.3% on MATH-500 and scores 57.1% on RefusalBench — more than triple GPT-4o’s 17.67%.

That combination has made Hermes 4 one of the most debated releases in the open-source AI community since it launched in late August 2025.

What Hermes 4 Actually Is

Hermes 4 is a family of post-trained models built on Meta’s Llama 3.1 checkpoints, available in three sizes: 14B, 70B, and 405B parameters. The key technical innovation is what Nous Research calls “hybrid reasoning” — users can toggle between fast, direct responses and a deeper deliberation mode where the model shows its thinking process inside <think>...</think> tags.

This isn’t a new base model. It’s a post-training approach applied on top of existing Llama 3.1 weights. What makes it interesting is the scale of that post-training. The training corpus jumped from roughly 1 million samples and 1.2 billion tokens (Hermes 3) to approximately 5 million samples and 60 billion tokens — a 50x increase in data volume.

Two proprietary systems power this training pipeline:

DataForge: A graph-based synthetic data generator inspired by AgentInstruct. It uses directed acyclic graphs (DAGs) where each node transforms data through a struct-to-struct mapping. Starting from pre-training seed data (DCLM, FineWeb), it can transform source material through multiple steps — turning a Wikipedia article into instruction-answer pairs, for instance. The system produced roughly 5 million samples totaling 19 billion tokens.
Atropos: An open-source reinforcement learning framework that runs rejection sampling against approximately 1,000 task-specific verifiers. This creates a large corpus of verified reasoning trajectories, ensuring the model’s step-by-step thinking actually leads to correct answers rather than plausible-sounding nonsense.

Nous Research published a 94-page technical report alongside the release, which is notably more transparent than what most open-weight model releases provide.

The Numbers: Benchmarks and RefusalBench

The headline benchmark for the 405B model in reasoning mode is 96.3% on MATH-500. That puts it in the same territory as frontier proprietary models on mathematical reasoning tasks.

But the number generating the most discussion is RefusalBench — a new benchmark Nous Research created specifically to measure how often models refuse to engage with prompts. The premise is straightforward: many requests that commercial models refuse are legitimate (creative writing with conflict, security research questions, hypothetical scenarios), and excessive refusal degrades utility.

Here’s how the models compare on RefusalBench:

Model	RefusalBench Score
Hermes 4 405B (reasoning)	57.1%
GPT-4o	17.67%
Claude Sonnet 4	17.0%

The gap is significant. Hermes 4 answers prompts that the two leading commercial models refuse roughly two-thirds of the time. The later Hermes 4.3 update pushed this even further, reaching 74.60% in non-reasoning mode.

Critics on Hacker News pointed out that Nous Research designed RefusalBench themselves, which raises obvious questions about benchmark shopping. But the underlying dataset has been published, and the open-source community has been running independent tests that broadly confirm the model’s willingness to engage with a wider range of prompts than its commercial counterparts.

The Hermes 4.3 Update: Smaller, Decentralized, Competitive

In December 2025, Nous Research followed up with Hermes 4.3, a 36B parameter model based on ByteDance’s Seed 36B architecture instead of Llama. The performance claim is bold: roughly equivalent to Hermes 4 70B at half the parameter count.

What makes Hermes 4.3 notable beyond raw performance is how it was trained. It’s the first production model post-trained entirely on the Psyche network — a distributed training infrastructure that uses the DisTrO optimizer to coordinate training nodes across data centers over the open internet, secured by Solana blockchain consensus. Whether decentralized training becomes a real paradigm shift or remains a niche experiment is an open question, but Hermes 4.3 is the first model to put real results behind the concept.

The 36B size also hits a sweet spot for local deployment. It’s small enough to run on consumer hardware with quantization, while large enough to be genuinely useful for complex tasks. For the r/LocalLLaMA crowd, this is where the practical excitement sits — a model they can actually run at home without a data center.

How It Compares to the Competition

Hermes 4 occupies a specific niche in the open-source model landscape. Here’s where it stands relative to key competitors:

vs. Llama 3.1 (base): Hermes 4 is built on Llama 3.1, so the comparison is direct — it’s the same architecture with extensive post-training. The math and reasoning improvements are substantial, and the reduced refusal rate is the primary differentiation.

vs. DeepSeek V3/R1: DeepSeek models generally offer stronger coding performance and newer architectures. Multiple Hacker News commenters noted that DeepSeek V3 and GLM 4.5 may outperform Hermes 4 on general tasks. The counterargument is that Hermes 4’s hybrid reasoning mode and minimal restrictions offer something DeepSeek doesn’t prioritize.

vs. GPT-4o / Claude: On pure capability, the 405B model is competitive but not clearly superior to frontier commercial models on most benchmarks. Where it wins decisively is on cost (open weights mean self-hosting is free) and content flexibility (the RefusalBench gap is enormous).

vs. Other uncensored models: There are plenty of “uncensored” fine-tunes floating around Hugging Face, but most sacrifice significant capability for reduced restrictions. Hermes 4’s pitch is that you don’t have to choose — 96.3% on MATH-500 is not a number you typically see from models optimized primarily for minimal guardrails.

Pricing and Access

The open weights are available on Hugging Face for all three sizes (14B, 70B, 405B), plus the newer 36B Hermes 4.3. Self-hosting is free.

For API access through OpenRouter:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
Hermes 4 405B	$1.00	$3.00	131,072 tokens
Hermes 4 70B	$0.13	$0.40	131,072 tokens

The 70B pricing is particularly aggressive — $0.13 per million input tokens makes it one of the cheapest capable reasoning models available through any API provider.

Community Reception: Enthusiasm Mixed with Skepticism

The launch generated substantial discussion across Hacker News, the Nous Research forums, and Hugging Face. The community reaction breaks down along predictable lines.

What people liked: The hybrid reasoning mode gets consistent praise for letting users choose when the model should think harder. The extensive technical report (94 pages) was appreciated as a signal of transparency. And for users who have felt constrained by commercial model guardrails, the RefusalBench results speak for themselves.

What drew criticism: The Hermes 4 landing page became a minor controversy on its own — the decorative WebGL animation reportedly consumed 3GB of VRAM on one user’s RTX 3090 Ti. More substantively, some commenters questioned whether a Llama 3.1-based model can genuinely compete with newer architectures. The benchmark presentation was called out for averaging unnamed competitors rather than comparing against clear state-of-the-art baselines. And the cyberpunk “operator” branding was described as “strong 14-year-old who just discovered Nietzsche energy” by at least one unimpressed commenter.

The December 2023 knowledge cutoff (inherited from Llama 3.1) is another limitation worth noting — the model’s world knowledge stops well before its release date.

FAQ

Is Hermes 4 free to use?
The model weights are free and open. You can download them from Hugging Face and run them locally without any licensing fees. API access through providers like OpenRouter has standard per-token pricing, with the 70B model available at $0.13 per million input tokens.

How does Hermes 4 compare to ChatGPT?
On mathematical reasoning, Hermes 4 405B (96.3% on MATH-500) is competitive with GPT-4o. The major difference is content policy: Hermes 4 scores 57.1% on RefusalBench compared to GPT-4o’s 17.67%, meaning it answers a much wider range of prompts. On general knowledge and coding tasks, GPT-4o still has advantages in some areas, particularly given the December 2023 knowledge cutoff.

Can I run Hermes 4 on my own hardware?
The 14B and 36B (Hermes 4.3) models can run on consumer GPUs with quantization. The 70B model requires higher-end hardware (multiple GPUs or a GPU with 48GB+ VRAM). The 405B model requires enterprise-grade multi-GPU setups or cloud instances. For most local deployment scenarios, the 36B Hermes 4.3 offers the best performance-to-hardware ratio.

What is RefusalBench?
RefusalBench is a benchmark created by Nous Research to measure how often AI models refuse to answer prompts. It tests scenarios that are commonly restricted by commercial models — creative writing involving conflict, security research, hypothetical discussions, and similar topics. Higher scores mean the model answers more prompts. The benchmark dataset is publicly available for independent verification.

Who is Nous Research?
Nous Research is an AI research organization known for producing open-weight models, including the Hermes series. They’ve been a consistent presence in the open-source AI community, with their models frequently appearing on Hugging Face leaderboards. Hermes 4 is their most ambitious release to date, accompanied by a 94-page technical report and two novel training systems (DataForge and Atropos).

Top AI Product

Leave a comment Cancel reply