Tommy-yw/RunbookHermes

372

+372/day

Python

Hermes-native AIOps agent for evidence-driven incident response, approval-gated remediation, and runbook learning.

From the README

RunbookHermes

Hermes-native AIOps Agent for payment incident response, evidence-driven root-cause analysis, approval-gated remediation, and runbook learning.

RunbookHermes is built by adapting the official Hermes Agent runtime into a production-oriented incident-response system. It keeps Hermes Agent's strengths—runtime loop, provider routing, tool system, memory, context engine, skills, gateway, and safety boundaries—and specializes them for AIOps workflows such as payment-system failures, observability evidence collection, approval, checkpoint, rollback, recovery verification, and runbook knowledge accumulation.

RunbookHermes is not a separate toy dashboard beside Hermes Agent. It is a Hermes-native vertical extension: Hermes provides the agent foundation; RunbookHermes adds the incident-response domain layer.

Product Screenshots

The screenshots below show the current RunbookHermes Web Console. Put these images under docs/assets/ and keep the file names consistent with the Markdown paths.

AIOps Console Overview

The overview page shows the high-level AIOps control plane: incident count, pending approvals, generated skills, critical services, recommended operation flow, current capability boundaries, and a live monitoring preview.

Realtime Monitoring System

The monitoring page provides a multi-dimensional service health view for payment-service, coupon-service, and order-service, including HTTP status signals, QPS, p95 latency, service topology, backend mode, and deployment state.

The lower section of the monitoring page shows log signals and trace signals. This is where RunbookHermes connects observability data to incident diagnosis instead of relying only on model guesses.

Incident Command Center

The incident list page normalizes incidents created from Web, Alertmanager, Feishu, WeCom, or API entry points. It shows service, status, severity, root cause, creation time, and quick incident creation actions.

Incident Detail: Evidence and Executive Summary

The incident detail page displays evidence cards from metrics, logs, and traces, plus an executive summary with root cause, recommended action, evidence IDs, confidence, and approval status.

Incident Detail: Root Cause and Model-Assisted Summary

The root-cause tab separates deterministic evidence from optional model-assisted explanation. The model summary is only enabled when a model provider is configured.

Incident Detail: Actions, Approvals, and Checkpoints

Risky actions are not executed blindly. RunbookHermes places write or destructive actions behind approval, checkpoint, dry-run, controlled execution, and recovery verification.

Incident Detail: Timeline

The timeline records the full incident lifecycle, including incident creation, evidence collection, hypothesis generation, action planning, checkpoint creation, approval request, approval decision, skill generation, and execution result.

Incident Detail: Generated Runb

View on GitHub