The Monitoring Workflow
Production skill monitoring follows a four-part process:- Logging - Record agent behavior during skill execution
- Evaluating - Measure performance using relevant metrics
- Dashboarding - Visualize metrics over time
- Aggregating - Use feedback to improve the skill
Logging Agent Behavior
OpenHands includes OpenTelemetry-compatible instrumentation via the Laminar library. Set up logging to capture agent traces during skill execution.For SDK Users
Set theLMNR_PROJECT_API_KEY environment variable to send traces to Laminar, or configure any OpenTelemetry-compatible backend:
For GitHub Actions
When using skills in GitHub workflows, add the API key to your action configuration. See the PR review action example for reference.Evaluating Performance
Define metrics that reflect whether your skill is working correctly. Effective metrics measure actual outcomes rather than intermediate steps.Example: PR Review Skill
For a code review skill, measure suggestion acceptance rate:- Number of suggestions made by the agent
- Number of suggestions incorporated by developers
Implementation Approach
- Create an evaluation workflow - Run after the main task completes (e.g., after PR merge)
- Collect relevant data - Agent output, human responses, final results
- Use LLM as judge - Feed data into a prompt that calculates metrics
Dashboarding Metrics
Visualize metrics over time to identify trends. With Laminar or similar platforms, create SQL queries that aggregate evaluation results. Track:- Metric trends (improving or degrading)
- Performance across different contexts (repos, file types, etc.)
- Comparison between prompt variations or models
Aggregating Feedback for Improvement
Use language models to analyze patterns in evaluation results and suggest skill improvements.Process
- Collect evaluation data - Aggregate analyses from recent runs
- Provide current skill content - Include the existing SKILL.md
- Use a reasoning model - Feed both into a long-context model (Gemini-2-Pro, Claude 3.5 Sonnet, etc.)
- Extract actionable suggestions - Review model output for concrete improvements
Example Output
Example output from aggregation:Deployment in Automated Workflows
Skills can run automatically in CI/CD pipelines. The OpenHands Extensions repository includes example GitHub Actions for common automation patterns.Common Automation Use Cases
- PR review - Run code review skills when PRs are marked “ready for review”
- Issue triage - Classify and label new issues
- Code generation - Generate boilerplate or documentation
- Security scanning - Check for vulnerabilities and suggest fixes
Best Practices
Choose Meaningful Metrics
Choose Meaningful Metrics
Select metrics that reflect real-world outcomes, not just intermediate steps.Good metrics:
- Suggestion acceptance rate (for code review)
- Issue classification accuracy (for triage)
- Time to resolution (for bug fixing)
- Number of suggestions made
- Lines of code generated
- Tokens consumed
Start Simple
Start Simple
Begin with basic logging before implementing complex evaluation pipelines.
- Set up OpenTelemetry logging
- Review traces manually to understand agent behavior
- Identify patterns in successes and failures
- Design metrics based on observed patterns
- Automate evaluation
Iterate on Skills Based on Data
Iterate on Skills Based on Data
Use evaluation results to make targeted improvements:
- Low accuracy → Review skill instructions for clarity
- Inconsistent behavior → Add more specific examples
- Context errors → Expand references/ with domain knowledge
- Repetitive failures → Create scripts for deterministic tasks
Monitor Multiple Dimensions
Monitor Multiple Dimensions
Track performance across different contexts:
- By repository - Different repos may need different approaches
- By file type - Skills may work better on certain languages
- By time - Identify degradation or improvement trends
- By model - Compare different LLM backends
Further Reading
- SDK Observability Guide - Detailed OpenTelemetry configuration
- GitHub Workflows - Automate skills in CI/CD
- Hooks Guide - Event-driven skill execution
- Creating Skills - Skill creation fundamentals

