Production Monitoring & Security Automation Runbook
Purpose
This runbook describes how to operate, monitor, and respond to events generated by the company’s production automation stack:
- n8n (orchestration)
- AdGuard DNS (Primary & Secondary Raspberry Pi)
- Fail2Ban
- Uptime Kuma
- Slack (alerting)
- Omada Controller (network devices)
It is written so that any on-call engineer can safely respond to alerts without deep system knowledge.
System Overview
What this system does
- Monitors DNS behavior on two AdGuard servers (Pi1 = Primary, Pi2 = Secondary)
- Detects possible DNS abuse / attacks using query heuristics
- Automatically blocks malicious IPs in AdGuard (when enabled)
- Monitors uptime of both DNS servers
- Pushes health heartbeats to Uptime Kuma
- Receives Fail2Ban ban/unban events from multiple hosts
- Receives Omada controller events (AP, gateway, switch up/down)
- Sends actionable alerts to Slack
What it does NOT do
- It does not permanently blacklist IPs without review
- It does not modify firewall rules (DNS-layer only)
- It does not auto-restart servers
Normal Operation (Healthy State)
Expected behavior
- Cron runs every minute
- Slack is quiet most of the time
- Uptime Kuma shows:
- Pi1 Uptime: UP
- Pi2 Uptime: UP
- DNS Status: NORMAL
Normal Slack messages
- ✅ DNS NORMAL (baseline)
- ✅ DNS OK / RECOVERED
- ✅ Fail2Ban UNBANNED
- ℹ️ Omada informational events
No action is required in these cases.
Alert Types & Response Actions
🚨 POSSIBLE DNS ATTACK
Meaning
- One client is responsible for an abnormally high percentage of DNS queries
- Triggered when:
- ≥ 80% of recent queries OR
- ≥ 400 queries in sample window
Automatic actions
- AdGuard auto-block may already be applied
- IP reputation (IPinfo) is attached to the alert
Required response (step-by-step)
- Open the Slack alert
- Review:
- Attacker IP
- Client name (if known)
- Organization / ASN
- Log into the affected AdGuard server
- Open Query Log
- Confirm traffic pattern matches alert
- If legitimate client:
- Remove IP from disallowed_clients
- Add client to DNS whitelist in n8n
- If malicious:
- No action needed (auto-block handled it)
Escalation
- Repeated attacks from different IPs → notify network/security team
✅ DNS OK / RECOVERED
Meaning
- DNS traffic has returned to normal
Action
- None required
🔴 / 🚨 UPTIME DOWN
Meaning
- DNS server is unreachable or returning bad HTTP status
Response steps
- Check Uptime Kuma for confirmation
- Attempt to reach host:
- Ping
- HTTPS access
- If unreachable:
- Check power
- Check network connectivity
- Review system logs if accessible
- Restart service/server if required
Escalation
- If downtime > SLA threshold, notify management
🚫 Fail2Ban BANNED
Meaning
- Fail2Ban blocked an IP due to repeated authentication failures
Automatic actions
- IP already blocked at service level
- Geo/IP data added automatically
Response steps
- Review IP reputation in Slack
- Confirm jail name (sshd, nginx, etc.)
- If internal or known IP:
- Manually unban
- Adjust Fail2Ban rules if needed
- If external/malicious:
- No action required
🚨 Omada Device DOWN
Meaning
- AP, gateway, or switch disconnected
Response steps
- Identify device and site in Slack alert
- Check Omada Controller status
- Verify power and uplink
- If multiple devices affected:
- Suspect upstream outage
Environment & Configuration
Required environment variables (n8n)
F2B_TOKENIPINFO_TOKENKUMA_PI1_UPTIME_URLKUMA_PI1_DNS_URLKUMA_PI2_UPTIME_URLKUMA_PI2_DNS_URL
Webhook endpoints
/fail2ban-pi1/fail2ban-pi2/OmadaController/JoeOmadaTPlink
Maintenance & Safe Changes
Before making changes
- Disable auto-block if testing
- Clone workflow for testing
- Verify Slack output formatting
After changes
- Manually trigger workflow
- Confirm:
- No duplicate Slack alerts
- Kuma heartbeats still flow
Break-Glass (Emergency)
If automation behaves incorrectly:
- Disable the n8n workflow
- Remove IPs from AdGuard block list
- Notify security/network team
- Document incident
Ownership
- System owner: IT / Network Team
- Primary contact: IT Manager
- Slack channel: Monitoring / Security Alerts