Empowering Automated Infrastructure Resilience

From Monitoring to Remediation: Building AximWatch for Agency Uptime Operations

A multi-tenant Flask SaaS that goes beyond “is it up?”—validating real page health with keyword checks, tracking uptime and SSL expirations, alerting teams and client stakeholders, capturing server logs, and triggering controlled AWS EC2 recovery when downtime persists.

The Problem: Uptime Tools Don’t Match Agency Reality

Agency monitoring isn’t just “ping the homepage and hope for the best.” With dozens of client sites across different stacks, hosts, and configurations, you run into two failures fast: false positives (alerts for brief blips) and false confidence (HTTP 200 responses that still represent a broken experience—WAF blocks, cached error templates, empty renders, or “site is up” while conversions are dead). At the same time, most tools stop at detection. They don’t preserve the operational context your team needs—who to notify, what changed, how long it’s been down, what the server logs say, and what action should happen next.

AximWatch was built to solve that workflow end-to-end: treat availability as application-level health, maintain a clean incident timeline, route alerts to both internal staff and client stakeholders, and connect downtime to practical response—including diagnostics and controlled infrastructure remediation.

Key Features of AximWatch

Explore the cutting-edge functionalities that make AximWatch an indispensable tool for infrastructure management.

01

Application-level uptime checks (not just ping/HTTP):

Verifies real page health using keyword/content validation to catch soft failures (200 OK but broken content).

02

Multi-tenant agency model:

Teams, members, roles, and team-owned websites so the platform matches agency/MSP operations.

03

Incident timeline + uptime history:

Stores Up/Down segments with timestamps and durations, plus “ongoing downtime duration” for active incidents.

04

Maintenance mode controls:

Prevents intentional downtime from triggering noisy alerts or remediation.

05

Stakeholder alerting:

 Notifies both platform users and non-user contacts (client stakeholders) with per-recipient preferences; includes test notification tooling.

06

SSL expiration monitoring:

Daily checks that record certificate expiry so renewals are proactive instead of reactive.

07

Diagnostics on demand:

Pulls recent access/error log tails over SSH to speed up triage.

08

Automated remediation:

 If downtime persists past a threshold, triggers a controlled AWS EC2 reboot workflow tied to the monitored site/server mapping.

09

AWS integrations:

Uses AWS services (SNS/SES/EC2/CloudWatch) to support alert delivery, metrics, and infrastructure actions.

10

Portfolio-friendly reporting views:

Dashboards for company/team visibility and per-website reporting with uptime percentages and event history.

Technical Highlights

Advanced System Architecture

AximWatch is structured like an operational control plane for an agency portfolio: the Flask web application owns tenancy, monitoring policy, alert routing, and incident history, while monitored websites and their servers are modeled as managed assets with attached health rules. At the center is a relational domain model (Teams, TeamMembers, Websites, Uptime segments, LogRecords, and notification entities) designed specifically for incident-grade reporting. Instead of treating uptime as a simple boolean, the system records state transitions into time-segmented events, enabling accurate uptime percentages, clean start/end boundaries, and durable audit trails. This keeps the “business truth” of what happened (and when) in the database—not in transient logs.

On the execution side, scheduled background tasks run the monitoring loop, SSL expiry checks, and supporting metric updates, while the integration layer connects monitoring results to real operational actions. Website checks include application-level validation via keyword/content matching to catch soft failures that HTTP-only checks miss. When state changes occur, AximWatch closes the prior uptime segment, opens a new segment, and routes notifications to both internal team members and external stakeholders. For sustained outages, it escalates from detection to controlled remediation by mapping a website to its infrastructure and triggering an AWS EC2 reboot only after a defined downtime window—favoring stability over aggressive automation. The result is a system that doesn’t just detect incidents; it preserves context, accelerates diagnosis, and supports consistent response at scale.

Impact and Adoption

AximWatch was built to protect recurring revenue by reducing downtime risk across an agency-managed website portfolio. Instead of treating monitoring as “is the server responding,” it validates application-level health (including keyword/content checks), preserves incident timelines, and supports structured escalation—so teams can detect issues earlier, trust the signal, and respond with context. The result is a more reliable service experience that supports retention, renewals, and client confidence without increasing operational burden.

On the operational side, AximWatch consolidates tooling and reclaims labor by connecting detection to diagnostics and controlled remediation. It reduces reliance on third-party monitoring subscriptions and shrinks after-hours triage time by automating repeatable actions (with safeguards such as maintenance mode and delayed remediation thresholds). Exact dollar totals and client-specific financials are intentionally not published to respect confidentiality and professional obligations; the platform’s impact is presented as ranges and operational KPIs rather than proprietary earnings.

KPI Readout
ARR Protected (annualized): Mid five figures (exact figures not disclosed)
Tool Cost Savings (annualized): Low four figures (exact figures not disclosed)
Reclaimed Labor Capacity: Hundreds of hours/year (range-based estimate)
Reliability Uplift: ~99.9% → ~99.99% (portfolio-level improvement)
Incidents Mitigated: 100+ (production period total)

Note: Specific client revenue, internal pricing, and exact monetary totals are withheld due to confidentiality and professional obligations; metrics are shown in ranges or operational terms.

AximWatch gained adoption because it fits the reality of agency operations: multiple technicians, multiple client stakeholders, and dozens of websites that need consistent visibility without giving every recipient a full platform account. The system is multi-tenant by design, with team-based ownership and role-aware access patterns, so internal staff can operate quickly while client contacts can receive alerts through non-user notification routing—reducing friction and avoiding account sprawl.

Adoption also held because the tool improved daily workflow—not just incident response. Dashboards and per-website reporting made it easy to review health at a glance, while “Check Now” verification and Maintenance Mode kept the monitoring signal clean during planned changes. By pairing reliable detection with diagnostic context and a clear escalation path, AximWatch became something operators trusted: it didn’t merely notify—it documented what happened, preserved the timeline, and supported structured action. Exact customer counts and revenue-linked adoption metrics are not disclosed due to confidentiality and professional obligations; adoption is described in operational terms.

Adoption KPI Readout
Deployment Model: Multi-tenant team-based rollout
Primary Users: Agency operators + on-call responders
Stakeholder Reach: Internal users + external non-user contacts
Onboarding Method: Team membership + invite/token registration
Workflow Fit Controls: Maintenance Mode + “Check Now” verification
Reporting Cadence: Dashboard view + per-site incident history
Confidentiality Note: Exact adoption totals withheld (client/private data)

Acknowledgements

Gratitude and Recognition

I’m grateful to Shardul and Benjamin at Axim Solutions for supporting early evaluation and enabling operational pilot use that accelerated iteration and hardening. I also appreciate Mark Elliot at NCPI for being an early hands-on tester and providing practical operational feedback during the initial rollout phase. Specific commercial terms, internal infrastructure details, and client-sensitive information are intentionally omitted due to confidentiality and professional obligations.

Thanks to Cassandra Williams for accessibility-oriented UX review and workflow refinement informed by her experience supporting users who rely on accessible web interfaces, and to my daughter Kimberly for family support during intensive build cycles.