Top AI Product

We track trending AI tools across Product Hunt, Hacker News, GitHub, and more  — then write honest, opinionated takes on the ones that actually matter. No press releases, no sponsored content. Just real picks, published daily.  Subscribe to stay ahead without drowning in hype.


RYS-XLarge (LLM Neuroanatomy): How Copying 7 Layers — With Zero Training — Topped the HuggingFace Leaderboard

A developer in his basement, two RTX 4090 gaming GPUs, and zero gradient descent. That’s what it took to claim the #1 spot on the HuggingFace Open LLM Leaderboard. No fine-tuning, no new data, no expensive compute cluster. David Noel Ng simply copied seven middle layers from an existing 72B model, pasted them back in, and watched the benchmarks climb.

The result — RYS-XLarge — is a 78B parameter model that outperformed everything else on five out of six evaluation benchmarks. And the technique behind it, called layer duplication or “LLM Neuroanatomy,” has since spawned the top four models on the leaderboard. As of early 2026, all of them are 78B descendants of this one weird trick that shouldn’t have worked.

What RYS-XLarge Actually Does

RYS stands for “Repeat Your Self,” which is literally what the model does. The method takes MaziyarPanahi’s calme-2.1-qwen2-72b (itself based on Qwen2-72B) and applies a specific layer duplication configuration: (45, 52).

In practice, this means:

  • Layers 0 through 51 execute normally
  • Layers 45 through 51 execute again (a second pass)
  • Then layers 52 through 79 continue as usual

That’s it. Seven layers run twice. No weights are modified. The model simply gets more iterations through its internal reasoning space. The parameter count bumps from 72B to 78B because those seven layers now exist twice in the architecture, but they’re identical copies — not new parameters trained on new data.

The key insight is what this doesn’t require: no training data, no GPU-hours of backpropagation, no RLHF, no DPO. Just inference-time architectural surgery.

The Brain Scanner: How Ng Found the Right Layers

The discovery process is arguably more interesting than the result itself. Ng built what he calls a “brain scanner” for transformers — a systematic method for mapping which layers handle which cognitive tasks.

He tested 3,241 different layer duplication configurations using two carefully chosen proxy tasks:

Hard mathematics — cube roots and multiplication of enormous numbers, scored with partial credit. This probes raw numerical reasoning ability.

Emotional quotient — EQ-Bench social scenarios where the model predicts emotional state intensity on a 0-100 scale. This tests a completely different cognitive axis.

By running every possible (i, j) pair through these two orthogonal tasks, Ng generated heatmaps that function like “functional MRIs” of the transformer while it thinks. The patterns revealed something striking about how large language models organize themselves internally:

  • Early layers (0-15): Handle input encoding. Duplicating any single layer here actively hurts performance.
  • Middle layers (30-60): Contain reasoning circuits. Specific ranges within this zone showed dramatic improvements when duplicated.
  • Late layers (60+): Handle output decoding. Duplication here degrades quality.

The critical finding: single-layer duplication almost universally fails. Too few layers — nothing happens. Too many — performance drops. Only circuit-sized blocks of roughly 7 layers produce gains. Ng argues this means transformer middle layers aren’t doing independent iterative refinement. They function as “indivisible multi-step reasoning pipelines” — coherent units that perform complete cognitive operations.

All of this discovery work ran on two RTX 4090s. No H100 cluster needed.

The Numbers: +17.72% on MuSR, Five of Six Benchmarks Up

Here’s what happened when RYS-XLarge hit the Open LLM Leaderboard’s six evaluation benchmarks:

Benchmark Score Change
IFEval (0-Shot) 79.96 -2.05%
BBH (3-Shot) 58.77 +2.51%
MATH Lvl 5 (4-Shot) 38.97 +8.16%
GPQA (0-Shot) 17.90 +2.58%
MuSR (0-Shot) 23.72 +17.72%
MMLU-PRO (5-Shot) 49.20 +0.31%
Average 44.75 +2.61%

Five out of six benchmarks improved, with only IFEval (instruction following) taking a small hit. The MuSR result — a 17.72% jump on multi-step reasoning — stands out. The MATH improvement of 8.16% is significant too. These are exactly the reasoning-heavy benchmarks where extra “thinking layers” should help, and they did.

The average score of 44.75 placed RYS-XLarge at #1 on the leaderboard at the time of submission.

How It Compares: SOLAR, LoopLM, and the Curse of Depth

RYS-XLarge isn’t the first attempt at layer manipulation in LLMs, but it’s different from prior approaches in important ways.

SOLAR 10.7B (Upstage, 2024) used “depth up-scaling” — duplicating the entire model and removing some layers from each copy before concatenating them. But SOLAR required continued pretraining after the architectural change to recover performance. RYS-XLarge needs none.

LoopLM and Looped Transformers explore repeated computation through learned halting policies, where models dynamically decide how many times to loop through layers. This is architecturally more elegant but requires training from scratch with the looping mechanism built in.

“The Curse of Depth” (2025) — a paper that found the second half of transformer layers in Pre-LayerNorm architectures contribute far less to output than the first half. This actually supports Ng’s findings: the middle layers matter most, and giving them extra compute cycles via duplication is one way to counteract depth inefficiency.

What makes RYS unique among these approaches is its simplicity and zero-cost nature. There’s no training pipeline to set up, no hyperparameters to tune, no data to curate. You pick a layer range, duplicate it, and evaluate. The entire method is orthogonal to fine-tuning — meaning you can apply RYS and then fine-tune on top, which is exactly what the community did.

The Leaderboard Takeover: All Top Four Are RYS Descendants

The open-source community didn’t just acknowledge RYS-XLarge — they built on it. Within months:

  • MaziyarPanahi fine-tuned RYS-XLarge to create calme-2.4-rys-78b
  • dfurman applied ORPO training to produce CalmeRys-78B-Orpo-v0.1
  • Other derivatives followed, each layering additional optimization on top of the duplicated architecture

By early 2026, the top four models on the Open LLM Leaderboard were all 78B parameter models — all descendants of RYS-XLarge, with scores ranging from 50.77 to 52.08. The base technique proved fully compatible with fine-tuning, ORPO, and other standard post-training methods.

On Hacker News, the Show HN post pulled 461 points and over 120 comments. The community reaction ranged from “this is research-grade effort, publish it at NeurIPS” to practical discussions about replicating the approach on other architectures. At least one commenter reported successfully replicating the method on a different model, getting a 23.5% improvement on reasoning tasks by duplicating layers 48-53 twice.

The model itself has 90 likes on HuggingFace, with quantized versions available for llama.cpp, LM Studio, and Ollama. It’s MIT licensed.

What This Means for the Future of Open-Source LLMs

The implications go beyond one leaderboard position. If large language models develop organized “functional anatomy” during training — with identifiable regions handling reasoning, encoding, and decoding — then layer duplication is just the beginning.

Community discussions have floated several ideas: variable-depth inference where easy prompts skip the duplication loop while hard ones use it; pluggable “knowledge banks” that swap specialized layer blocks; and standardized layer libraries for dynamic model composition.

There’s a practical cost consideration too. Layer duplication increases inference compute (those seven layers run twice) and expands the KV cache, but it doesn’t require additional VRAM — since the duplicated layers are pointers to the same weights in memory. For many deployment scenarios, that’s a worthwhile trade.

The deeper question is whether these findings generalize beyond Qwen2. Ng’s ongoing work suggests they do — his heatmaps across newer architectures from different model families keep showing the same general patterns, though “every architecture has its own neuroanatomy.” Qwen, MiniMax, GLM, and others all show identifiable reasoning circuits in their middle layers, just in different positions.

Frequently Asked Questions

How much does RYS-XLarge cost to create?
Essentially nothing beyond the base model download and inference compute. The entire discovery process ran on two consumer RTX 4090 GPUs (roughly $3,200 total retail). No training, fine-tuning, or cloud compute was needed. The layer duplication itself is a configuration change, not a compute-intensive process.

Does layer duplication work on any LLM?
The general principle — that middle-layer duplication can improve reasoning — appears to hold across architectures, but the optimal layer range varies per model. You need to run the heatmap analysis (or experiment with different ranges) for each architecture. Ng’s work shows Qwen, MiniMax, and GLM models all have exploitable reasoning circuits, but the specific (i, j) configuration differs.

How does RYS-XLarge compare to Qwen2-72B directly?
RYS-XLarge is built on a fine-tuned variant of Qwen2-72B (calme-2.1-qwen2-72b), not the raw base model. The layer duplication adds 6B parameters (72B to 78B) and improves the average benchmark score by 2.61%, with the biggest gains in multi-step reasoning (+17.72% on MuSR) and math (+8.16% on MATH Lvl 5).

What are the downsides of layer duplication?
Inference is slower because seven layers execute twice, and the KV cache grows proportionally. The IFEval benchmark (instruction following) dropped by 2.05%, suggesting the extra reasoning layers can slightly hurt the model’s ability to follow formatting instructions precisely. The technique also doesn’t add new knowledge — it only amplifies existing reasoning capacity.

Can I run RYS-XLarge locally?
Yes. Quantized versions (GGUF format) are available for llama.cpp, LM Studio, Jan, and Ollama. The full-precision model requires significant VRAM (multiple high-end GPUs), but 4-bit quantized versions can run on consumer hardware. The model is released under the MIT license.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment