What success rate indicates strong performance on the CL-bench contextual learning tasks?

Given that CL-bench is designed to test learning novel information from context, what constitutes 'good' performance? I saw a reference to GPT-5.1 (high) scoring only 23.7%; does this represent the current state-of-the-art low, and what would be considered a strong result on this benchmark?

Best Answer

Admin

2026-02-03

The reported performance metrics from the initial release of CL-bench suggest that achieving high success rates on tasks requiring strict contextual learning is currently very challenging for even the most advanced models.

Interpreting Current Results

The figure cited—GPT-5.1 (high) achieving only a 23.7% success rate—serves as a critical data point. It illustrates the severe gap between a model’s general capabilities (e.g., solving complex math problems) and its ability to perform genuine Contextual Learning within a prompt. This 23.7% likely represents the current SOTA *low* for models heavily reliant on static knowledge when faced with zero-shot learning within the context window.

Defining Strong Performance

Since the benchmark’s core goal is to force models to learn entirely new, pre-trained information, a 'strong' result would theoretically be significantly higher, ideally approaching human-level performance on similar tasks. The research team's objective is to fundamentally change how Large Language Models are optimized. Therefore, any model demonstrating significantly higher success rates than the baseline 23.7% would signal a meaningful architectural or training breakthrough in genuine in-context adaptation rather than mere retrieval of parameterized knowledge.

Asked by: User Asked: 2026-02-03 Answered: 2026-02-03 Share Q&A

Disclaimer: All information, posts, and comments on this site are for learning and reference only and do not represent our views. They do not constitute investment, trading, legal, or other advice. Users assume all risks arising from the use of this content. Content may come from the public web, user submissions, or AI assistance. If you believe your rights are infringed, please email bruce#fungather.com or add WeChat full_star_service, and we will verify and remove it promptly.