Monitoring and Improving Skills

After creating and deploying a skill, monitor its performance to ensure it works correctly in production. This is particularly important for skills used in automated workflows like CI/CD pipelines.

The Monitoring Workflow

Production skill monitoring follows a four-part process:

Logging - Record agent behavior during skill execution
Evaluating - Measure performance using relevant metrics
Dashboarding - Visualize metrics over time
Aggregating - Use feedback to improve the skill

Logging Agent Behavior

OpenHands includes OpenTelemetry-compatible instrumentation via the Laminar library. Set up logging to capture agent traces during skill execution.

For SDK Users

Set the LMNR_PROJECT_API_KEY environment variable to send traces to Laminar, or configure any OpenTelemetry-compatible backend:

export LMNR_PROJECT_API_KEY="your-api-key"

See the SDK Observability Guide for detailed configuration options including Honeycomb, Jaeger, Datadog, and other OTLP-compatible backends.

For GitHub Actions

When using skills in GitHub workflows, add the API key to your action configuration. See the PR review action example for reference.

Evaluating Performance

Define metrics that reflect whether your skill is working correctly. Effective metrics measure actual outcomes rather than intermediate steps.

Example: PR Review Skill

For a code review skill, measure suggestion acceptance rate:

suggestion_accuracy = ai_suggestions_reflected / ai_suggestions

Track:

Number of suggestions made by the agent
Number of suggestions incorporated by developers

Implementation Approach

Create an evaluation workflow - Run after the main task completes (e.g., after PR merge)
Collect relevant data - Agent output, human responses, final results
Use LLM as judge - Feed data into a prompt that calculates metrics

Example evaluation prompt excerpt:

### ai_suggestions
Count items where the body contains an actionable code suggestion
(look for code blocks, "suggestion:", specific changes to make).
Do NOT count general praise or approval-only comments.

### ai_suggestions_reflected
Count suggestions that were incorporated. A suggestion is "reflected" if:
1. A human response indicates the suggestion was implemented, OR
2. The suggestion appears in the final diff

See the evaluation action example for a complete implementation.

Dashboarding Metrics

Visualize metrics over time to identify trends. With Laminar or similar platforms, create SQL queries that aggregate evaluation results. Track:

Metric trends (improving or degrading)
Performance across different contexts (repos, file types, etc.)
Comparison between prompt variations or models

Aggregating Feedback for Improvement

Use language models to analyze patterns in evaluation results and suggest skill improvements.

Process

Collect evaluation data - Aggregate analyses from recent runs
Provide current skill content - Include the existing SKILL.md
Use a reasoning model - Feed both into a long-context model (Gemini-2-Pro, Claude 3.5 Sonnet, etc.)
Extract actionable suggestions - Review model output for concrete improvements

Example Output

Example output from aggregation:

### Issue: Context-Unaware Suggestions
The agent suggests technically correct changes that conflict with
repository conventions (e.g., suggesting integration tests when the
repo uses mocks).

Frequency: ~15% of suggestions
Recommendation: Add repo-specific testing philosophy to references/

Deployment in Automated Workflows

Skills can run automatically in CI/CD pipelines. The OpenHands Extensions repository includes example GitHub Actions for common automation patterns.

Common Automation Use Cases

PR review - Run code review skills when PRs are marked “ready for review”
Issue triage - Classify and label new issues
Code generation - Generate boilerplate or documentation
Security scanning - Check for vulnerabilities and suggest fixes

See the GitHub Workflows guide for SDK-based automation examples.

Best Practices

Choose Meaningful Metrics

Select metrics that reflect real-world outcomes, not just intermediate steps.Good metrics:

Suggestion acceptance rate (for code review)
Issue classification accuracy (for triage)
Time to resolution (for bug fixing)

Poor metrics:

Number of suggestions made
Lines of code generated
Tokens consumed

Start Simple

Begin with basic logging before implementing complex evaluation pipelines.

Set up OpenTelemetry logging
Review traces manually to understand agent behavior
Identify patterns in successes and failures
Design metrics based on observed patterns
Automate evaluation

Iterate on Skills Based on Data

Use evaluation results to make targeted improvements:

Low accuracy → Review skill instructions for clarity
Inconsistent behavior → Add more specific examples
Context errors → Expand references/ with domain knowledge
Repetitive failures → Create scripts for deterministic tasks

Monitor Multiple Dimensions

Track performance across different contexts:

By repository - Different repos may need different approaches
By file type - Skills may work better on certain languages
By time - Identify degradation or improvement trends
By model - Compare different LLM backends

Get Started

Essential Guidelines

Onboarding OpenHands

Product Guides

Integrations

Additional Documentation

OpenHands Community

Monitoring and Improving Skills

The Monitoring Workflow

Logging Agent Behavior

For SDK Users

For GitHub Actions

Evaluating Performance

Example: PR Review Skill

Implementation Approach

Dashboarding Metrics

Aggregating Feedback for Improvement

Process

Example Output

Deployment in Automated Workflows

Common Automation Use Cases

Best Practices

Further Reading

Get Started

Essential Guidelines

Onboarding OpenHands

Product Guides

Integrations

Additional Documentation

OpenHands Community

​The Monitoring Workflow

​Logging Agent Behavior

​For SDK Users

​For GitHub Actions

​Evaluating Performance

​Example: PR Review Skill

​Implementation Approach

​Dashboarding Metrics

​Aggregating Feedback for Improvement

​Process

​Example Output

​Deployment in Automated Workflows

​Common Automation Use Cases

​Best Practices

​Further Reading

The Monitoring Workflow

Logging Agent Behavior

For SDK Users

For GitHub Actions

Evaluating Performance

Example: PR Review Skill

Implementation Approach

Dashboarding Metrics

Aggregating Feedback for Improvement

Process

Example Output

Deployment in Automated Workflows

Common Automation Use Cases

Best Practices

Further Reading