AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages! You are not logged in. Login here for full access privileges. |
Previous Message | Next Message | Back to Slashdot <-- <--- | Return to Home Page |
|
||||||
From | To | Subject | Date/Time | |||
![]() |
VRSS | All | How Do Olympiad Medalists Judge LLMs in Competitive Programming? |
June 17, 2025 9:20 AM |
||
Feed: Slashdot Feed Link: https://slashdot.org/ --- Title: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Link: https://slashdot.org/story/25/06/17/149238/ho... A new benchmark assembled by a team of International Olympiad medalists suggests the hype about large language models beating elite human coders is premature. LiveCodeBench Pro, unveiled in a 584-problem study [PDF] drawn from Codeforces, ICPC and IOI contests, shows the best frontier model clears just 53% of medium-difficulty tasks on its first attempt and none of the hard ones, while grandmaster-level humans routinely solve at least some of those highest-tier problems. The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks. Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers. Read more of this story at Slashdot. --- VRSS v2.1.180528 |
||||||
|
Previous Message | Next Message | Back to Slashdot <-- <--- | Return to Home Page |
![]() Execution Time: 0.016 seconds If you experience any problems with this website or need help, contact the webmaster. VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved. Virtual Advanced Copyright © 1995-1997 Roland De Graaf. |