Production Monitoring & Security Automation Runbook

Purpose

This runbook describes how to operate, monitor, and respond to events generated by the production security automation stack. It covers the full lifecycle of an alert — from initial detection through investigation, response, and escalation — for the following systems:

n8n — Workflow orchestration engine that drives all automation logic and alert routing.
AdGuard Home (Primary — Pi 1 and Secondary — Pi 2) — DNS resolvers that provide query logging, filtering, and abuse detection across the network.
Fail2Ban — Intrusion prevention system that monitors authentication logs and automatically blocks brute-force source IPs.
Uptime Kuma — Service availability monitor that tracks the health of DNS servers, containers, and other critical endpoints.
Slack — Alerting and notification delivery channel for all actionable events.
Omada Controller — Network device management platform monitoring access points, gateways, and switches.

This document is written so that any on-call engineer can safely assess and respond to alerts without requiring deep knowledge of the underlying systems. Every response procedure is self-contained and includes explicit pass/fail criteria.

System Overview

What This System Does

Monitors DNS query behavior on both AdGuard servers (Pi 1 = Primary, Pi 2 = Secondary) and detects patterns indicative of abuse or attack using configurable query heuristics.
Automatically blocks malicious source IPs in AdGuard Home when the auto-block feature is enabled, without requiring manual intervention for confirmed threats.
Monitors the availability and uptime of both DNS servers and reports status to Uptime Kuma on every automation cycle.
Pushes heartbeat signals to Uptime Kuma every minute, confirming that the automation pipeline itself is running correctly.
Receives and processes Fail2Ban ban and unban events from multiple hosts, enriching them with IP reputation data before forwarding alerts to Slack.
Receives Omada Controller device events — including access point, gateway, and switch state changes — and delivers actionable alerts.
Sends formatted, actionable alerts to Slack for all events that require human awareness or intervention.

What This System Does NOT Do

Understanding the boundaries of automation is as important as understanding what it handles. This system deliberately does not:

Permanently blacklist IPs without human review. Automated blocks are temporary and scoped. Permanent additions to any blocklist require explicit operator action.
Modify host-level firewall rules. All automated blocking operates at the DNS layer (AdGuard Home) only. UFW and iptables rules are never modified by automation.
Auto-restart servers or services. Availability alerts require a human to investigate root cause before any restart is performed. Automated restarts are intentionally out of scope to prevent masking underlying issues.

Normal Operation (Healthy State)

Expected Behavior

Under normal operating conditions, the following should be true at all times:

The n8n automation cron runs every minute without interruption. A gap in Uptime Kuma heartbeat signals is the primary indicator that the automation pipeline has stalled.
Slack is quiet during normal periods. Frequent alert messages during a calm period may indicate a misconfigured detection threshold or a false-positive loop.
Uptime Kuma displays the following statuses for all monitored endpoints:
- Pi 1 Uptime: UP
- Pi 2 Uptime: UP
- DNS Status: NORMAL

Normal Slack Messages (No Action Required)

The following Slack messages are informational and do not require any operator response:

✅ DNS NORMAL — Baseline confirmation that DNS query traffic is within expected parameters.
✅ DNS OK / RECOVERED — A previously elevated DNS condition has resolved and traffic has returned to normal.
✅ Fail2Ban UNBANNED — A previously blocked IP has been released from the Fail2Ban jail after its ban duration expired.
ℹ️ Omada informational events — Routine device state changes (e.g., AP reconnected, controller heartbeat) that do not indicate a fault condition.

No action is required for any of the above messages. If these messages appear at an unusually high frequency, investigate whether an underlying condition is causing repeated state transitions.

Alert Types and Response Procedures

🚨 POSSIBLE DNS ATTACK

What this means

A single client IP is responsible for an abnormally high proportion of recent DNS queries on one of the AdGuard servers. This pattern is consistent with DNS amplification abuse, reconnaissance scanning, or a misconfigured device generating excessive query volume. The alert fires when either of the following thresholds is exceeded:

The source IP accounts for 80% or more of queries in the current sample window, regardless of total volume, or
The source IP has generated 400 or more queries in the current sample window, regardless of its percentage share.

Automatic actions already taken by the time you receive this alert

If auto-block is enabled, the source IP has already been added to AdGuard Home’s disallowed clients list. DNS queries from this IP are being dropped.
IP reputation data from IPinfo (organization, ASN, country, and abuse classification) has been retrieved and attached to the Slack alert for your review.

Required response — follow these steps in order

Open the Slack alert and record the attacker IP, client name (if resolved), organization, and ASN.
Log into the affected AdGuard server (Pi 1 or Pi 2 as identified in the alert).
Open the Query Log and filter by the source IP. Confirm that the traffic pattern — query volume, query types, and target domains — matches what the alert described.
Determine whether the source is a legitimate client or a malicious actor:
- If the IP belongs to a known legitimate client (e.g., an internal device that was misconfigured or restarted in a loop): remove the IP from the AdGuard disallowed clients list, then add it to the DNS whitelist in n8n to prevent future false-positive blocks.
- If the IP is external, unknown, or confirmed malicious: no further action is required. The auto-block has already handled it. Monitor subsequent alerts to determine whether the attack is continuing from new IPs.

Escalation

If attacks are arriving in rapid succession from multiple distinct IP addresses — particularly from different ASNs or geographic regions — this may indicate a coordinated distributed attack rather than a single abusive client. In this case, notify the network and security team immediately rather than attempting to manage the response alone.

✅ DNS OK / RECOVERED

What this means

DNS query traffic on the affected server has returned to normal levels following a previously elevated condition. The detection system has cleared the alert state automatically.

Required action

None. This message is informational. If a DNS ATTACK alert preceded this message, the situation has resolved. No follow-up is required unless the attack resumes.

🔴 UPTIME DOWN

What this means

A monitored DNS server (Pi 1 or Pi 2) is either completely unreachable over the network or is returning an unexpected HTTP status code from its health endpoint. This alert fires when Uptime Kuma fails to receive a successful response within the configured timeout period.

Required response — follow these steps in order

Open Uptime Kuma and confirm the alert is active and not a transient check failure. If Uptime Kuma shows the host recovering on its own within one to two minutes, treat it as a brief connectivity blip and monitor for recurrence.
Attempt to reach the affected host:
- Ping the host’s WireGuard IP (e.g., ping 10.8.0.2) to determine whether the tunnel is up and the host is reachable at the network layer.
- Attempt HTTPS access to the AdGuard Home admin interface to determine whether the service itself is running.
If the host is completely unreachable (ping fails):
- Check physical power to the Raspberry Pi.
- Check the network uplink — verify that the LAN switch port and any intermediate network equipment are functioning.
- Check whether the WireGuard tunnel has dropped on the server side using sudo wg and confirming a recent handshake timestamp for the affected peer.
If the host is reachable by ping but the service is not responding, SSH into the Pi over WireGuard and review system logs: sudo journalctl -u adguardhome --since "10 minutes ago".
Restart the AdGuard Home service only after reviewing logs and confirming there is no data corruption or configuration issue that would cause an immediate re-failure: sudo systemctl restart adguardhome.

Escalation

If the host cannot be recovered remotely and downtime has exceeded the SLA threshold defined for DNS availability, notify management and the network team. If both Pi 1 and Pi 2 are down simultaneously, DNS resolution for the entire network may be affected and this becomes a critical priority incident.

🚫 Fail2Ban BANNED

What this means

Fail2Ban has detected and blocked a source IP that exceeded the configured authentication failure threshold for a monitored service (typically SSH, nginx, or a web application login endpoint). The source IP has already been added to the relevant jail and is currently blocked at the service level.

Automatic actions already taken by the time you receive this alert

The source IP is already blocked by Fail2Ban for the duration configured in the relevant jail.
Geographic and IP reputation data has been retrieved and attached to the Slack alert.

Required response — follow these steps in order

Review the IP address and reputation data in the Slack alert. Note the jail name — this tells you which service was targeted (e.g., sshd indicates an SSH brute-force attempt, nginx-http-auth indicates a web authentication attack).
Determine whether the blocked IP is a known internal or trusted address:
- If the IP belongs to an internal device, administrator, or trusted partner: manually unban the IP using sudo fail2ban-client set <jail_name> unbanip <ip_address>. Review the Fail2Ban configuration to determine whether the threshold or whitelist needs adjustment to prevent recurrence.
- If the IP is external or confirmed malicious: no action is required. Fail2Ban has handled it. The ban will expire automatically after the configured duration.

🚨 Omada Device DOWN

What this means

The Omada Controller has detected that a managed network device — an access point, gateway, or switch — has become disconnected or stopped responding to controller polls. This may indicate a power failure, a network uplink failure, or a device firmware or software fault.

Required response — follow these steps in order

Identify the specific device and site from the Slack alert. The alert should include the device name, MAC address, and site location.
Log into the Omada Controller and navigate to the device’s status page to confirm the disconnection and review any available last-seen timestamps or error messages.
Physically verify power to the device. For PoE-powered access points, check whether the PoE switch port is delivering power (many managed switches display per-port PoE status in their web interface).
Verify the network uplink — check whether the switch port the device connects to is active, and whether intermediate switches between the device and the controller are functioning.
If multiple devices at the same site go down simultaneously, suspect an upstream outage rather than individual device failures. Check the site’s core switch and internet gateway before investigating individual access points.

Environment and Configuration Reference

Required Environment Variables (n8n)

The following environment variables must be set in the n8n instance for the automation workflows to function correctly. Missing or incorrect values will cause silent failures in alerting or heartbeat delivery:

F2B_TOKEN — Authentication token used to receive Fail2Ban webhook events from monitored hosts.
IPINFO_TOKEN — API key for the IPinfo service, used to enrich alerts with geographic and reputation data for blocked IPs.
KUMA_PI1_UPTIME_URL — Uptime Kuma push URL for Pi 1 availability heartbeats.
KUMA_PI1_DNS_URL — Uptime Kuma push URL for Pi 1 DNS status heartbeats.
KUMA_PI2_UPTIME_URL — Uptime Kuma push URL for Pi 2 availability heartbeats.
KUMA_PI2_DNS_URL — Uptime Kuma push URL for Pi 2 DNS status heartbeats.

Webhook Endpoints

The following n8n webhook paths receive inbound events from external systems. These endpoints must remain reachable from the hosts and controllers that post to them:

/fail2ban-pi1 — Receives Fail2Ban ban and unban events from Pi 1.
/fail2ban-pi2 — Receives Fail2Ban ban and unban events from Pi 2.
/OmadaController — Receives device state change events from the primary Omada Controller.
/JoeOmadaTPlink — Receives device state change events from the secondary Omada/TP-Link controller.

Maintenance and Safe Change Procedures

Before Making Changes

The following steps must be completed before modifying any active automation workflow or configuration:

Disable auto-block in the workflow configuration before testing detection logic changes. This prevents test traffic from inadvertently blocking legitimate IPs during the testing window.
Clone the workflow rather than editing the production version directly. Test all changes in the cloned workflow and confirm correct behavior before applying changes to the active production workflow.
Verify Slack output formatting on the cloned workflow by triggering it manually and reviewing the alert structure in the test Slack channel before promoting to production.

After Making Changes

After applying any change to a production workflow, complete the following verification steps before considering the change stable:

Manually trigger the workflow at least once and review the execution log in n8n for any errors or unexpected node failures.
Confirm no duplicate Slack alerts are being generated. Duplicate alerts indicate a workflow branching or deduplication logic issue that must be resolved before the change is left in production.
Confirm Uptime Kuma heartbeats are still flowing for all monitored endpoints. A change that inadvertently breaks the heartbeat delivery path will cause false-positive downtime alerts within minutes.

Break-Glass Emergency Procedure

Use this procedure if the automation system is behaving incorrectly — for example, blocking legitimate IPs in a loop, sending repeated duplicate alerts, or making unexpected changes to AdGuard configuration. The goal is to halt all automated actions immediately, restore a known-good state, and preserve enough information to diagnose the root cause.

Disable the affected n8n workflow immediately. In the n8n interface, toggle the workflow to inactive. This stops all scheduled and triggered executions instantly. Do not attempt to debug a running workflow — disable first, investigate second.
Review and clean up the AdGuard block list. Log into both Pi 1 and Pi 2 AdGuard Home instances and review the disallowed clients list. Remove any IPs that were incorrectly added by the runaway automation. Cross-reference against known-good IPs before removing entries.
Notify the security and network team. Provide the time the issue started, a description of what the automation was doing incorrectly, and a list of any IPs that were blocked and have since been removed.
Document the incident. Record the timeline, root cause (once identified), impact, and remediation steps taken. This documentation is required before the workflow is re-enabled. Do not re-enable the workflow until the root cause is understood and a fix has been tested in a cloned workflow.

Ownership and Contacts

System owner: IT / Network Team — responsible for ongoing maintenance, configuration changes, and SLA compliance.
Primary contact: IT Manager — escalation point for incidents that cannot be resolved by on-call response procedures.
Slack alert channel: Monitoring / Security Alerts — all automated alerts are delivered here. On-call engineers should ensure they have notifications enabled for this channel during their rotation.

N8N Production Server Monitoring & Security Automation

Production Monitoring & Security Automation Runbook

Purpose

System Overview

What This System Does

What This System Does NOT Do

Normal Operation (Healthy State)

Expected Behavior

Normal Slack Messages (No Action Required)

Alert Types and Response Procedures

🚨 POSSIBLE DNS ATTACK

✅ DNS OK / RECOVERED

🔴 UPTIME DOWN

🚫 Fail2Ban BANNED

🚨 Omada Device DOWN

Environment and Configuration Reference

Required Environment Variables (n8n)

Webhook Endpoints

Maintenance and Safe Change Procedures

Before Making Changes

After Making Changes

Break-Glass Emergency Procedure

Ownership and Contacts

Comments

Leave a Reply Cancel reply

More posts

How to Fix VMware Module Errors When Secure Boot Is Enabled on Linux

Watchtower Fork: Fixing Abandoned Docker Auto-Update Tool

Install Docker & Docker Compose on Ubuntu 24.04

Speed Up AppImages on Ubuntu: GPU Acceleration Guide

T7 GSC Injector Black Ops 3 GoldHEN PS4 Guide