Back to YouTube

Parker RexApril 26, 2025

I Built Flappy Bird Using Every AI Model (You Wont Guess Who Won)

Watch me build Flappy Bird with every AI model (Gemini 2, 01, 03 mini variants) and see which one wins in this live benchmark.

Watch on YouTube Subscribe to Parker Rex →

Next show Previous show

Show Notes

I ran a hands-on benchmark by rebuilding Flappy Bird in Python with Pygame and feeding the same prompt to a lineup of AI models. The goal was to see which model delivers the most usable, responsive game loop, and Grok 3 turns out to be the surprise winner.

Setup and approach

Same prompt used across all models to keep the comparison fair.
Each model was run via a separate folder with a small, self-contained Python script (main.py) that uses Pygame.
The workflow: open multiple terminals, cd into each model folder (01, 03, 04 mini, etc.), and run python main.py to test.
Observations often involved rendering quirks (bird shapes: square, circle, triangle) and control responsiveness (gravity, speed, and spacebar behavior).
Tools noted:
- VS Code for editing and quick syntax fixes
- Copilot used live to fix syntax issues when needed
- Pygame as the game engine required by the prompt

Example prompt approach (paraphrased):

A fixed prompt was used for all models to request a self-contained Flappy Bird implementation using Pygame, with gravity, collision, and a responsive spacebar control.

Models tested

Gemini family
- Gemini 2 flashing
- 01
- 03 mini
- 03 mini high
Sonnet / Anthropic family
- Sonnet 37
- Sonnet 35
- EnAnthropic (Anthropic’s style prompt)
Other Open AI family
- Gemini 25
- OpenAI’s latest model (as of the test period)
04 mini family
- 04 mini
- 04 mini high
Grock family
- Grock 3 (the sleeper hit)

Observations and quick verdicts

Gemini 2 flashing
- Verdict: rough start, slow and unimpressive
Gemini 01
- Verdict: poor baseline
Gemini 03 mini
- Verdict: underwhelming behavior, gravity and control feel off
Gemini 03 mini high
- Verdict: notably better; one of the better performers in the run
Sonnet 37
- Verdict: strong potential, good thinking/adjustments; bugged reset at one point
Sonnet 35
- Verdict: hit-or-miss; some syntax/implementation issues
EnAnthropic
- Verdict: mixed results; reliability varied across runs
Gemini 25
- Verdict: mixed-to-average outcomes; not the standout
OpenAI latest
- Verdict: included in the mix, but not clearly the best in this batch
04 mini
- Verdict: fast start, but encountered stability/syntax problems; not consistently reliable
04 mini high
- Verdict: one of the faster, more responsive attempts; still some quirks
Grock 3
- Verdict: the clear winner in this test; finally delivered a reliable, playable result
- Noted as the “line 129” moment where the implementation aligned cleanly with the prompt
- Final assessment: Grock 3 beat the others for playable performance and stability

The winner and what it means

Grock 3 emerged as the winner of this Flappy Bird benchmark.
Why it stood out:
- More stable gravity and collision behavior
- More reliable spacebar responsiveness
- Fewer quirky rendering anomalies compared to other models
Takeaway: even with a single shared prompt and the same task, model performance can be all over the map. The best result often comes from a model that handles the control flow and timing more predictably, not just raw language quality.

Notable moments and gotchas

The experiment wasn’t just about accuracy; it was about “playability” and responsiveness in a simple game loop. Some models produced nice text but struggled to drive a real-time PyGame window smoothly.
Syntax hiccups and code integration issues were common (hence the Copilot mentions). Having a quick fix path can save lots of time but doesn’t change the underlying model performance.
Visual quirks (bird shapes, velocities) reminded that rendering decisions can skew perceived performance even when logic is correct.

How you can reproduce

Use a fixed prompt across all models to ensure comparability.
Create a separate folder per model and place a minimal PyGame-based main.py in each.
Run: python main.py from each model’s folder and observe:
- Responsiveness to spacebar
- Consistency of gravity and bird movement
- Stability of the game loop (no freezes or crashes)
Grade each model by how playable the result is, not just whether it runs.

Actionable takeaways

When benchmarking AI models on code tasks, prioritize runtime behavior and user experience (e.g., input responsiveness, frame rate stability) in addition to correctness.
A single “best” model can be elusive; you may need to test multiple generations within a provider’s lineup to find a playable option.
Don’t rely on a model’s language quality alone—test its ability to produce real-time, deterministic results in an executable environment.
Having quick tooling fixes (e.g., Copilot or editor assist) is helpful, but keep the core tests model-centric rather than tool-centric.

Links

Pygame: https://www.pygame.org
Visual Studio Code: https://code.visualstudio.com
GitHub Copilot: https://github.com/features/copilot
Google Gemini: https://ai.google/discover/gemini
Anthropic Claude: https://www.anthropic.com/claude

If you want more AI/battery-tested benchmarks like this, I’ll run additional rounds and share the breakdowns with a focus on real-time performance and robustness.

Next show Previous show