Show Notes
I ran a hands-on benchmark by rebuilding Flappy Bird in Python with Pygame and feeding the same prompt to a lineup of AI models. The goal was to see which model delivers the most usable, responsive game loop, and Grok 3 turns out to be the surprise winner.
Setup and approach
- Same prompt used across all models to keep the comparison fair.
- Each model was run via a separate folder with a small, self-contained Python script (main.py) that uses Pygame.
- The workflow: open multiple terminals, cd into each model folder (01, 03, 04 mini, etc.), and run python main.py to test.
- Observations often involved rendering quirks (bird shapes: square, circle, triangle) and control responsiveness (gravity, speed, and spacebar behavior).
- Tools noted:
- VS Code for editing and quick syntax fixes
- Copilot used live to fix syntax issues when needed
- Pygame as the game engine required by the prompt
Example prompt approach (paraphrased):
- A fixed prompt was used for all models to request a self-contained Flappy Bird implementation using Pygame, with gravity, collision, and a responsive spacebar control.
Models tested
- Gemini family
- Gemini 2 flashing
- 01
- 03 mini
- 03 mini high
- Sonnet / Anthropic family
- Sonnet 37
- Sonnet 35
- EnAnthropic (Anthropic’s style prompt)
- Other Open AI family
- Gemini 25
- OpenAI’s latest model (as of the test period)
- 04 mini family
- 04 mini
- 04 mini high
- Grock family
- Grock 3 (the sleeper hit)
Observations and quick verdicts
- Gemini 2 flashing
- Verdict: rough start, slow and unimpressive
- Gemini 01
- Verdict: poor baseline
- Gemini 03 mini
- Verdict: underwhelming behavior, gravity and control feel off
- Gemini 03 mini high
- Verdict: notably better; one of the better performers in the run
- Sonnet 37
- Verdict: strong potential, good thinking/adjustments; bugged reset at one point
- Sonnet 35
- Verdict: hit-or-miss; some syntax/implementation issues
- EnAnthropic
- Verdict: mixed results; reliability varied across runs
- Gemini 25
- Verdict: mixed-to-average outcomes; not the standout
- OpenAI latest
- Verdict: included in the mix, but not clearly the best in this batch
- 04 mini
- Verdict: fast start, but encountered stability/syntax problems; not consistently reliable
- 04 mini high
- Verdict: one of the faster, more responsive attempts; still some quirks
- Grock 3
- Verdict: the clear winner in this test; finally delivered a reliable, playable result
- Noted as the “line 129” moment where the implementation aligned cleanly with the prompt
- Final assessment: Grock 3 beat the others for playable performance and stability
The winner and what it means
- Grock 3 emerged as the winner of this Flappy Bird benchmark.
- Why it stood out:
- More stable gravity and collision behavior
- More reliable spacebar responsiveness
- Fewer quirky rendering anomalies compared to other models
- Takeaway: even with a single shared prompt and the same task, model performance can be all over the map. The best result often comes from a model that handles the control flow and timing more predictably, not just raw language quality.
Notable moments and gotchas
- The experiment wasn’t just about accuracy; it was about “playability” and responsiveness in a simple game loop. Some models produced nice text but struggled to drive a real-time PyGame window smoothly.
- Syntax hiccups and code integration issues were common (hence the Copilot mentions). Having a quick fix path can save lots of time but doesn’t change the underlying model performance.
- Visual quirks (bird shapes, velocities) reminded that rendering decisions can skew perceived performance even when logic is correct.
How you can reproduce
- Use a fixed prompt across all models to ensure comparability.
- Create a separate folder per model and place a minimal PyGame-based main.py in each.
- Run: python main.py from each model’s folder and observe:
- Responsiveness to spacebar
- Consistency of gravity and bird movement
- Stability of the game loop (no freezes or crashes)
- Grade each model by how playable the result is, not just whether it runs.
Actionable takeaways
- When benchmarking AI models on code tasks, prioritize runtime behavior and user experience (e.g., input responsiveness, frame rate stability) in addition to correctness.
- A single “best” model can be elusive; you may need to test multiple generations within a provider’s lineup to find a playable option.
- Don’t rely on a model’s language quality alone—test its ability to produce real-time, deterministic results in an executable environment.
- Having quick tooling fixes (e.g., Copilot or editor assist) is helpful, but keep the core tests model-centric rather than tool-centric.
Links
- Pygame: https://www.pygame.org
- Visual Studio Code: https://code.visualstudio.com
- GitHub Copilot: https://github.com/features/copilot
- Google Gemini: https://ai.google/discover/gemini
- Anthropic Claude: https://www.anthropic.com/claude
If you want more AI/battery-tested benchmarks like this, I’ll run additional rounds and share the breakdowns with a focus on real-time performance and robustness.