Back to YouTube
Parker RexNovember 5, 2024

Claude Releases New Model (Haiku 3.5) | Code Testing & Impressions

Claude Haiku 3.5: code testing, first impressions, pricing, and benchmarks vs Gemini/GPT—what it means for coders and agents.

Show Notes

Claude Haiku 3.5 is put to the test in Parker’s hands-on look at coding and UI prompts, with a practical lens on cost, performance, and real-world workflows. Here are the key takeaways and hands-on findings.

Benchmark landscape at a glance

  • Claude Haiku 3.5 performance
    • On the AER (Agent for Coding) leaderboard: around 4th place with ~75% in coding tasks, trailing the top model.
    • Pricing vs. capability: marketed as a cheaper option, but user feedback flags the price as high for a model that isn’t the absolute leader.
  • Reasoning benchmarks
    • Demonstrates solid chain-of-thought reasoning, but still not on par with the flagship, top-tier models.
    • Compared to Gemini/other premium models, it’s competitive in some aspects but notably costlier.
  • Practical takeaway
    • If you’re optimizing for price-to-performance in coding tasks, Haiku 3.5 is a mixed bag: better than many but expensive for what you get relative to the best-in-class.

Hands-on testing: prompts and results

  • Test setup (summary)
    • Two primary prompts/tests: a Pixel-Perfect clone prompt (component copy) and a small web app UI task to compare Sunet vs Hau/HaCou outputs.
    • Same prompts run against the two models to compare results and speed.
  • Test 1: Pixel-Perfect clone (component copy)
    • Sunet: produced more usable, structured results with clearer code blocks and narrative guidance.
    • Hau/HaCou: struggled to start or render clean outputs; outputs were less consistent and harder to translate into workable code.
    • Takeaway: Sunet tends to be stronger for direct coding tasks that require clean structure and stepwise delivery.
  • Test 2: One-shot web app with an ingredient history UI
    • Prompt involved generating a UI that shows ingredient history changes over time.
    • Sunet: produced a more complete artifact (code + architecture notes) but acknowledged gaps needing refinement.
    • Hau/HaCou: often failed to render a coherent TSX flow or skipped key pieces, making it less reliable for this kind of task.
    • Takeaway: For frontend app generation, Sunet more consistently delivers usable scaffolds; HaCou lags behind in this test.

Prompt engineering and architect-style workflow

  • Architect-focused workflow
    • The idea from AER-style practice: start with planning, then define data models, APIs, and TS/React stack steps.
    • Example high-level approach (condensed prompt pattern):
      • Act as the software architect
      • Define data models and API endpoints
      • Output a structured plan in Markdown with code blocks and explanations
  • Model behavior differences
    • Sunet tends to produce clean markdown with distinct sections, code blocks, and step-by-step guidance.
    • HaCou tends to mix formats and can require more manual post-processing to extract usable code.
  • Practical note
    • If the goal is documentation-to-code or vice versa, Sunet’s output tends to align better with developer workflows and tooling.

Practical workflows and takeaways

  • Low-code evaluation workflow
    • Make.com (Integromat) is a practical path to automate model testing without writing code.
    • Build a workflow that sends repeated prompts, collects outputs, and stores results in a spreadsheet for quick comparison.
  • Documentation-ready AI
    • For turning a website or doc into well-formatted Markdown with snippets, you can feed a URL and have the AI generate structured docs, checklists, and examples.
  • Quick-start prompts for coding
    • Use an architect-style prompt to define scope, then drill down with TS/React/Tailwind specifics.
    • Example, minimal prompt snippet:
      • Act as the software architect. Define data models, API endpoints, and a TypeScript/React plan. Output in Markdown with code blocks for key parts.
  • What to test with your team
    • Run side-by-side comparisons on your own coding tasks (component scaffolds, API schemas, UI flows) to see which model handles your typical patterns best.
    • Track both quality and speed to judge whether the cost aligns with your productivity gains.

Final takeaways

  • Haiku 3.5 is not the top dog on coding benchmarks, but it’s a credible option with strong consistency in some tasks.
  • Sunet generally outperforms HaCou for coding-focused prompts and architecture-style workflows, often at a better price-performance point.
  • If you’re building workflows that rely on repeated AI tasks, consider low-code automation (Make.com) to test and compare models efficiently.
  • Use architecture-first prompts to drive clearer outputs (data models, APIs, and structured steps) before diving into code generation.

Actionable next steps

  • Do a two-model test for your key coding tasks and measure time-to-useful-output versus cost.
  • Set up a Make.com workflow to automate iterative prompts and export results for quick side-by-side analysis.
  • Experiment with an architecture prompt to see how each model formats its plan; use Sunet’s markdown/code structure to scaffold your project.