SP Oncall: Multi-Agent Network Investigation

SP Oncall is an experimental AI-driven network investigation system for Service Provider (SP) networks. It automates network diagnostics and troubleshooting by analyzing device state, identifying issues, and generating detailed root-cause reports. I'm mostly using it to learn and demo about AI solutions for networking.

What Does It Do?

Think of SP Oncall as a team of specialized AI agents that work together to investigate network problems:

Input Validator — Understands the incoming query or alert and decides which devices belong to the investigation.
Context Investigation — Investigates related devices first, such as neighbors or surrounding topology, to build supporting context.
Primary Investigation — Investigates the main target devices using the context-phase findings.
RCA Assessor — Reviews the completed investigation phases and extracts the most likely root cause.
Report Generator — Produces the final human-readable report and updates per-device history.

Architecture

The graph is a linear orchestration pipeline: input_validator_node → context_investigation → primary_investigation → rca_assessor_node → report_generator. Each investigation phase is a sub-graph that handles its own retries internally.

Inside each investigation sub-graph, you'll see four internal nodes in LangGraph Studio: plan_device (creates investigation strategy), execute_device (queries network devices), collect_device_result (aggregates findings), and assess_device (evaluates if objective is met). If assessment fails and retries remain, the phase loops back to execute.

Key Features

Optional alert integration: Connect to an external observability stack (like xrd-observability-stack) for automated investigations triggered by network alerts.
Per-device memory: Device Profiles store role, BGP AS, neighbors, and topology facts across runs, plus last alert and health status — all in the LangGraph Store, no external DB required.
Two-phase investigations: Context phase investigates related devices (neighbors, topology) first; primary phase investigates target devices using context findings.
Multi-device concurrency: Multiple devices are investigated in parallel within each phase.
Internal retry loop: Each phase can retry up to max_retries times (default: 3) before moving on.
Skill-based planning: Investigation strategies live in skills/ as Markdown files. Manual queries use all skills; alert-triggered investigations filter by event type.

Prerequisites

Before you can use SP Oncall, you'll need these tools installed on your system:

Make — A build automation tool that helps run common commands (install via your package manager).
uv — A fast Python package manager (alternative to pip).
OpenAI API Key — Required if using OpenAI models (default). OpenRouter is also supported.
LangSmith Account — For LangGraph Studio.
Network Devices — Your actual network equipment, or use the DevNet XRd Sandbox for testing.
gNMIBuddy — A gNMI MCP server that provides a simple interface to query network devices. SP Oncall uses it to interact with network devices.

Windows users: This project requires a Unix-like environment. Install WSL (Windows Subsystem for Linux) to run it on Windows.

Quick Start

1. Clone and install

git clone https://github.com/jillesca/sp_oncall
cd sp_oncall
make install

2. Configure environment

Copy .env.example to .env and fill in the required values:

cp .env.example .env

Required keys:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`LANGSMITH_API_KEY`	LangSmith API key (for tracing)
`LANGSMITH_PROJECT`	LangSmith project name (e.g. `sp_oncall`)
`LANGSMITH_TRACING`	Set to `true` to enable tracing

See the Configuration Reference below for all available options.

3. Configure network device access

SP Oncall uses gNMIBuddy MCP server to query network devices. Point mcp_config.json at your running gNMIBuddy instance:

{
  "gNMIBuddy": {
    "transport": "http",
    "url": "http://localhost:8000/mcp"
  }
}

4. Start

make run

This starts the LangGraph development server. Open LangGraph Studio at the URL shown in the terminal.

5. Send a query

In LangGraph Studio, start a new thread and type a query:

Check BGP neighbors on xrd-1
How are my PE routers performing?
Investigate all core P devices

For the optional alert-driven companion flow, see Optional Observability Integration below.

Testing with DevNet Sandbox

Don't have network devices? No problem! Use the DevNet XRd Sandbox — a free environment for testing.

Sandbox Setup

Reserve the DevNet XRd Sandbox (free account required).
Follow the sandbox instructions to start the containerized SR MPLS network using Docker.
Configure gNMI on the simulated devices.

To automatically configure gNMI on the XRd DevNet sandbox, run this helper script:

ANSIBLE_HOST_KEY_CHECKING=False \
bash -c 'TMPDIR=$(mktemp -d) \
&& trap "rm -rf $TMPDIR" EXIT \
&& curl -s https://raw.githubusercontent.com/jillesca/gNMIBuddy/refs/heads/main/ansible-helper/xrd_apply_config.yaml > "$TMPDIR/playbook.yaml" \
&& curl -s https://raw.githubusercontent.com/jillesca/gNMIBuddy/refs/heads/main/ansible-helper/hosts > "$TMPDIR/hosts" \
&& uvx --from "ansible-core==2.19.2" --with "paramiko,ansible" ansible-playbook "$TMPDIR/playbook.yaml" -i "$TMPDIR/hosts"'

If you have problems with Ansible

You can manually enable gNMI on each XRd device. Apply this configuration to all XRd devices:

grpc
 port 57777
 no-tls

Don't forget to commit your changes to XRd.

Configuration Reference

All SP_ONCALL_* variables can be set in your .env file. See .env.example for the full list with comments.

Variable	Default	Description
`SP_ONCALL_MAX_RETRIES`	`3`	Max execution retries per device investigation. Also overridable from LangGraph Studio.
`SP_ONCALL_FAST_MODEL`	`openai/gpt-4o-mini`	Model used for structured output parsing — faster and cheaper than the main reasoning model.
`SP_ONCALL_LOG_LEVEL`	`info`	Log level for sp_oncall modules (`debug` \| `info` \| `warning` \| `error`).
`SP_ONCALL_LANGCHAIN_DEBUG`	`false`	Enable verbose LangChain debug tracing.
`SP_ONCALL_MODULE_LEVELS`	—	Per-module log overrides (e.g. `sp_oncall.nodes=debug,langgraph=error`). Run `make logger-names` to list modules.
`SP_ONCALL_LOG_FILE`	—	Write logs to a file in addition to stdout.
`SP_ONCALL_EXTERNAL_SUPPRESSION_MODE`	`langgraph`	Suppress noisy external library logs (`langgraph` \| `none`).
`OPENROUTER_API_KEY`	—	Required only when using `openrouter/*` models (e.g. `openrouter/anthropic/claude-sonnet-4`).

AI model selection

In LangGraph Studio, click Manage Assistants to select the main reasoning model. Available models are defined in src/configuration.py under LLMModel and include OpenAI and OpenRouter options.

Investigation skills

Investigation strategies live in skills/ as Markdown files following the agentskills.io specification. Alert-triggered runs filter by event_type via src/util/skill_routing.py; manual queries use all available skills.

For detailed logging configuration, see src/logging/README.md.

For domain terminology (Alert, Investigation, Device Profile, Thread, etc.), see CONTEXT.md.

Optional Observability Integration

SP Oncall works on its own with manual queries in LangGraph Studio. If you want to experiment with an observability-driven workflow, use it together with xrd-observability-stack, which provides Grafana, Alertmanager, Prometheus, and the external webhook-receiver service that forwards alerts into SP Oncall.

Alert-Driven Workflow

Alert fires — the observability stack detects a network event and routes it to the external webhook-receiver service.
Webhook receiver — transforms the Grafana payload into a NetworkAlert and calls POST /runs on the LangGraph API.
Investigation runs in the background, executing the full graph: validator → context phase → primary phase → RCA → report.
Open LangGraph Studio and join the thread by its ID to watch the investigation progress in real-time.
Ask follow-up questions in the same thread — agents have full access to the investigation state and can dive deeper.

Testing with Sample Alerts

The scripts/test_alert.sh helper sends sample Grafana-style alerts to a webhook endpoint (useful for testing with xrd-observability-stack). It is experimental and not required for manual usage.

# Show the curl commands without sending (dry run)
bash scripts/test_alert.sh --dry-run

# Send a specific alert type
bash scripts/test_alert.sh interface_down
bash scripts/test_alert.sh bgp_down
bash scripts/test_alert.sh isis_down
bash scripts/test_alert.sh topology_degraded
bash scripts/test_alert.sh interface_flapping
bash scripts/test_alert.sh interface_errors

By default the script posts to http://localhost:8080/alert. Override with WEBHOOK_URL= if your receiver is running elsewhere. The receiver is not part of this repository — start it from xrd-observability-stack.

Getting Help

Issues: Check the GitHub issues page
Questions: Open a new issue with your question
Contributing: This is a proof-of-concept experiment. Contributions and forks welcome.

Learn More

gNMI: gRPC Network Management Interface
LangGraph: LangChain's workflow framework
XRd Observability Stack: Companion project for Grafana, Alertmanager, Prometheus, and webhook integration
DevNet Sandbox: Cisco's free network simulation environment

Testing

# If you cloned the repo
# Shutdown an interface for quick test
ANSIBLE_HOST_KEY_CHECKING=False \
uvx --from "ansible-core==2.19.2" --with "paramiko,ansible" \
ansible-playbook ansible-helper/xrd_apply_config.yaml -i ansible-helper/hosts