What success rate indicates strong performance on the CL-bench contextual learning tasks?

Given that CL-bench is designed to test learning novel information from context, what constitutes 'good' performance? I saw a reference to GPT-5.1 (high) scoring only 23.7%; does this represent the current state-of-the-art low, and what would be considered a strong result on this benchmark?

Best Answer
Admin
2026-02-03

The reported performance metrics from the initial release of CL-bench suggest that achieving high success rates on tasks requiring strict contextual learning is currently very challenging for even the most advanced models.

Interpreting Current Results

The figure cited—GPT-5.1 (high) achieving only a 23.7% success rate—serves as a critical data point. It illustrates the severe gap between a model’s general capabilities (e.g., solving complex math problems) and its ability to perform genuine Contextual Learning within a prompt. This 23.7% likely represents the current SOTA *low* for models heavily reliant on static knowledge when faced with zero-shot learning within the context window.

Defining Strong Performance

Since the benchmark’s core goal is to force models to learn entirely new, pre-trained information, a 'strong' result would theoretically be significantly higher, ideally approaching human-level performance on similar tasks. The research team's objective is to fundamentally change how Large Language Models are optimized. Therefore, any model demonstrating significantly higher success rates than the baseline 23.7% would signal a meaningful architectural or training breakthrough in genuine in-context adaptation rather than mere retrieval of parameterized knowledge.

Answer the Question

Please sign in to post.
Sign in / Register
Notice
Hello, world! This is a toast message.