AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page
   Local Database  Slashdot   [84 / 115] RSS
 From   To   Subject   Date/Time 
Message   VRSS    All   How Do Olympiad Medalists Judge LLMs in Competitive Programming?   June 17, 2025
 9:20 AM  

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Link: https://slashdot.org/story/25/06/17/149238/ho...

A new benchmark assembled by a team of International Olympiad medalists
suggests the hype about large language models beating elite human coders is
premature. LiveCodeBench Pro, unveiled in a 584-problem study [PDF] drawn
from Codeforces, ICPC and IOI contests, shows the best frontier model clears
just 53% of medium-difficulty tasks on its first attempt and none of the hard
ones, while grandmaster-level humans routinely solve at least some of those
highest-tier problems. The researchers measured models and humans on the same
Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when
stripped of terminal tools and limited to one try per task, lands at an Elo
rating of 2,116 -- hundreds of points below the grandmaster cutoff and
roughly the 1.5 percentile among human contestants. A granular tag-by-tag
autopsy identified implementation-friendly, knowledge-heavy problems --
segment trees, graph templates, classic dynamic programming -- as the models'
comfort zone; observation-driven puzzles such as game-theory endgames and
trick-greedy constructs remain stubborn roadblocks. Because the dataset is
harvested in real time as contests conclude, the authors argue it minimizes
training-data leakage and offers a moving target for future systems. The
broader takeaway is that impressive leaderboard jumps often reflect tool use,
multiple retries or easier benchmarks rather than genuine algorithmic
reasoning, leaving a conspicuous gap between today's models and top human
problem-solvers.

Read more of this story at Slashdot.

---
VRSS v2.1.180528
  Show ANSI Codes | Hide BBCodes | Show Color Codes | Hide Encoding | Hide HTML Tags | Show Routing
Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page

VADV-PHP
Execution Time: 0.016 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224