Who's Cheating in AI Tests?

Your Daily Dose of AI Goodness

In partnership with

Featured Story

War of the benchmarks

The TLDR
War of the benchmarks rages after Grok-3's release, with companies trading fraud accusations on X. Noam Brown suggests cost-effectiveness metrics instead. Sam Altman's free GPT-5 strategy proves distribution trumps performance, leveraging his Y Combinator marketing expertise.

Just hours after the release of xAI's Grok-3, a heated battle for the title of the best LLM erupted on the social media platform X, marked by accusations of fraud and denial of each other's achievements. This highlights a growing issue: traditional benchmarks are increasingly inadequate for evaluating models.

Shortly after, Noam Brown, the mastermind behind OpenAI's computational model o1, offered a solution. Instead of constantly creating new benchmarks for various metrics, he suggested evaluating new models based on their cost-effectiveness relative to performance. This is a smart approach, as cost is often an overlooked but crucial factor.

Lower costs allow for wider distribution, boosting the company’s brand while encouraging more people, including those with little or no prior exposure, to try out AI models. This is exactly why GPT-5 will be permanently available in the free tier—a brilliant strategic move by Sam Altman.

In the end, success isn’t just about having the best model but about achieving the widest distribution through effective PR. Despite fierce competition, OpenAI stands out with Sam Altman’s exceptional marketing skills. It’s no coincidence that he was President of Y Combinator before becoming CEO of OpenAI.

Today’s Sponsor

We only support advertisers we believe in and use. To keep the newsletter free, please consider checking out our sponsors by clicking below (only if you think it will be useful). Thanks!

Gamma is maybe our favorite tool for quickly generating a website or powerpoint with AI. Super easy. Give it a look…

The future of presentations, powered by AI

Gamma is a modern alternative to slides, powered by AI. Create beautiful and engaging presentations in minutes. Try it free today.

 

In the News

Flying Car Production Set for 2026

Alef Aeronautics successfully tests its Model A flying car that drives and takes off vertically. The company targets production by early 2026.

 

Claude's Extended Thinking Matches o3-mini

Claude 3.7 Sonnet with 16K thinking tokens achieves 28.6% performance, matching OpenAI's o3-mini. Longer thinking significantly improves results despite higher costs.

 

OpenAI Expands Deep Research Access

OpenAI extends Deep Research access to additional ChatGPT tiers, including Plus and Education. Users receive ten free monthly queries versus Pro's 120.

 

Reply

or to participate.