LaRA targets a fast-growing eval-integrity problem: as models go through reinforcement learning post-training, benchmark questions can quietly leak into the training data — and the model passes the test by remembering rather than reasoning. Layer-wise Representation Analysis is the proposed detector.
## Looking inside, not outside
Most contamination detection compares answers to suspected sources from the outside. LaRA looks inside, examining how the model’s internal states evolve across layers during and after RL post-training. Contamination leaves a fingerprint there — the way representations shift through the layers is different when a model has seen an evaluation example versus when it hasn’t. That signal is harder to scrub than a string match against the prompt.
## Why it matters
RL post-training has become standard for frontier models, and benchmark contamination is the unflattering question many strong scores can’t escape. Internal-state analysis is the cleaner answer: it works on the model you actually shipped, not on data you don’t have access to. As leaderboard pressure pushes labs to scrape harder, having a defensible way to ask “did the model see this benchmark during RL?” is the trust infrastructure the field needs — both for buyers comparing models and for researchers trying to keep evaluation honest. The work comes from Yonsei, Seoul National University, and Georgia Tech.

Leave a comment