Show Notes
Microsoft is integrating AI model experimentation directly into GitHub, throwing a big lever for enterprise AI. Here’s what you need to know and how to use it to move from endless prompts to real, testable results.
What GitHub Models is
- A workspace inside a private GitHub repo that lets you find and experiment with AI models for free.
- You get a versioned, playground-like environment (similar to Vercel's playground) embedded into your GitHub workflows.
- Aims to lower barriers to enterprise-grade AI adoption by tying AI development to familiar GitHub processes.
How it works (quick setup outline)
- In your repo, you’ll use a prompt.yaml file to define prompts, models, and parameters.
- When the repo is private, you get a secure, sandboxed environment for model experimentation.
- A new Models tab in GitHub Copilot UI exposes:
- Prompt name, description, and the chosen model
- Parameters and message stacks (system/user messages)
- A place to store and iterate on prompts directly alongside your code
- The ability to provide test messages or test inputs
Sample workflow:
- Create and edit prompts with system instructions, user prompts, and test inputs.
- Save prompts, model choices, and parameter settings to prompt.yaml.
- Run side-by-side evaluations of multiple models using identical prompts.
Code blocks and UI are designed to support back-and-forth prompt iteration, including a message stack for realistic dialog.
Core features you’ll actually use
- Prompt storage under version control
- Model selection and parameter tuning in one place
- Structured prompts with system instructions, test inputs, and variables
- Model comparison: run multiple models side by side with the same prompts and inputs
- Evaluators: scoring metrics like similarity, relevance, and groundedness to analyze outputs
- Prompt/model/parameter triage saved in a single file (prompt.yaml)
- Private repo safety: work remains in your org, with governance around model access
Evaluators and metrics
- Built-in evaluators let you score outputs to guide cost and quality:
- Similarity: how closely output matches expected ideas
- Relevance: alignment with the task
- Groundedness: factual alignment with inputs
- Use these scores to:
- Filter which model/prompts to adopt
- Optimize prompts for the best return
- Control costs by preferring cheaper models when quality is acceptable
Collaboration, governance, and admin controls
- Organization admins can allow all models or create a allow/deny list
- Makes cross-team prompt sharing practical while keeping governance tight
- Great for coordinating prompts across large teams and ensuring consistency
Tooling and the bigger picture
- GitHub Models is positioned to work with tool-ability in Copilot:
- API/tool-calls in the generative stack (coming via Microsoft’s tools)
- Possibility to define personal tools or company-specific toolsets within Copilot
- Plans around open-source Copilot concepts (C-Pilot) that could enable custom tool integration
- Practical implication: you could ship your own tools that Copilot can call, all managed inside your private repo
Note: While some of these tooling capabilities are still evolving, the direction is clearly toward embedded tool calls, extensible prompts, and self-contained tool ecosystems inside GitHub.
UI/Workflow highlights
- The models workspace shows a side-by-side comparison UI for models and prompts
- You can commit prompt changes like you would code changes, keeping prompts traceable
- The models page exposes code snippets you can drop into projects, enabling quick adoption
- Test data can be embedded in prompt.yaml so samples exist alongside your prompts
Real-world takeaways
- If you’re building prompts for products, GitHub Models lets you iterate faster with real evaluation metrics (no more guesswork)
- You can compare multiple models with identical prompts to see which one meets your needs at a given cost
- The ability to save prompts, models, and settings in a single YAML file makes it easier to version and share within teams
- Admin controls help you manage risk and keep teams aligned on which models are allowed
Practical tips and cautions
- Start simple (KISS): use a minimal prompt.yaml to get a feel for iteration before scaling
- Leverage evaluators early to quantify gains and justify costs
- Weigh privacy carefully: even in private repos, consider what data is used for training or feedback
- Expect the tooling to evolve quickly — stay flexible and keep prompts modular
Getting started (actionable steps)
- Create a private repo or use an existing one and enable the GitHub Models workspace
- Add a prompt.yaml with a basic prompt, a chosen model, and simple test inputs
- Use the Models tab to create side-by-side experiments (e.g., GPT-4 vs. another model)
- Enable evaluators (similarity, relevance, groundedness) and compare results
- Save promising combinations and incrementally add more prompts, tests, and parameters
- Explore tool-calls concepts and await upcoming Copilot/C-Pilot capabilities to add personal tools
Takeaways
- GitHub Models turns prompt experimentation into a first-class, version-controlled workflow
- You can compare models, use evaluators, and iterate quickly within GitHub
- Governance and private repos help teams adopt AI safely at scale
- This approach aligns with the broader trend toward embedded tool calls and customizable copilots