Easy for humans, hard for AI // challenges and benchmarks

sbagency
4 days ago

--

There are dozens of AI benchmarks // what’s really hard or impossible for AI?

Multiplayer games are great benchmark for AI. One model vs another one.

https://arcprize.org/blog/snakebench

- SnakeBench: An experimental challenge testing LLMs in 2,800 head-to-head snake games to assess their interaction with dynamic environments, real-time decision-making, and long-term strategy.

- Top Performers: o3-mini and DeepSeek excelled, winning 78% of their matches.

- Key Challenges: LLMs struggle with spatial reasoning due to textual format limitations, long-horizon planning, and brittleness to novelty or changes.

- Benchmark Design Insights: Context-heavy prompts improve performance; reasoning models perform better as they internally deliberate before output.

- Future Work: Plans include increasing board complexity, adding obstacles, and refining prompting techniques to reduce errors.

https://snakebench.com/findings
https://x.com/karpathy/status/1885740680804504010

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet