Global Cloud Resource Monitoring System Without Managed Services

Public cloud platformes like AWS, Azure, and Google Cloud offer rich resource monitoring services, but these services are typically relatively expensive and monitoring services on different cloud platforms are generally incompatible. Users who deploy multi-cloud solutions also don't want to be locked into a single provider. Building a monitoring system using mainstream open-source software based on cloud platform IaaS services is an operations best practice.

This article uses AWS cloud as an example to introduce how to build a global resource monitoring system without managed services. Other cloud platforms follow similar approaches.

Core Objective: Monitor AWS infrastructure resource health and self-hosted digital system operational status, with no managed service dependencies, global high availability, and 99.99% availability.

1. Core Design Boundaries and Principles

1. Monitoring Targets (Clearly Focused)

2. Design Principles

2. Overall Layered Architecture Design

Architecture Overview

Monitoring Targets (AWS Resources + Self-hosted Systems) → Metric Collection Layer (Self-deployed Collectors) → Transmission Forwarding Layer (Self-hosted Kafka) → Storage Analysis Layer (Prometheus + Thanos) → Visualization Alerting Layer (Grafana + Alertmanager)
                                  ↓                                      ↓
                          Cross-region Data Sync (Kafka MirrorMaker)        Disaster Recovery Backup Layer (S3 + Cross-region Replication)

3. Detailed Design of Each Layer (No Managed Service Dependencies)

(1) Metric Collection Layer: Comprehensive Data Collection, Full Coverage

Core Objective: Lightweight, Multi-source Collection, Adapted for AWS Resources + Self-hosted Systems

  1. Technology Selection (Purely Self-deployed): - Resource Metric Collection: Node Exporter (system-level), AWS CLI custom scripts (AWS resources), cAdvisor (container-level); - Service Metric Collection: Blackbox Exporter (port/HTTP liveness), JMX Exporter (middleware), custom Exporters (databases/applications); - Collector Deployment: All collectors run as Docker containers on EC2 or K8s nodes, no managed collection service dependencies.

  2. Collection Scenarios:

Monitoring TargetCollection Tool/MethodCore Metrics Collected
AWS EC2/EBSNode Exporter + AWS CLI scripts (timed AWS API calls)CPU usage, memory usage, disk IO, network throughput, EBS read/write latency, instance status (running/stopped/restarted)
AWS VPC/Security GroupsCustom Shell scripts (timed verification of route table reachability, security group rule validity) + Route53 health checksAvailable IPs in subnets, route table association status, security group port openness, VPC traffic drop rate
AWS ELB/S3AWS CLI scripts (timed queries of ELB health status, S3 storage usage/success rate)Healthy backend instances of ELB, request volume, response latency, S3 storage usage, PUT/GET success rate
Self-hosted K8s ClustercAdvisor + Kube-state-metrics (self-deployed)Pod/Node resource usage, Pod restart count, Deployment replica readiness, Node readiness status
Middleware (Kafka/MySQL)JMX Exporter (Kafka) + Custom SQL scripts (MySQL)Kafka partition health, Topic backlog, MySQL connections, query latency, master-slave sync delay
Application ServicesCustom Exporters (instrumentation exposure/log parsing)Interface response time, error rate, QPS, thread pool status
  1. Collection Optimization: - Collection Frequency: Core metrics (CPU/memory/liveness status) every 10 seconds, non-core metrics (storage usage/S3 status) every 60 seconds; - Lightweight: Collector containers limited to CPU≤0.2vCPU, memory≤256MB to avoid occupying business resources; - Fault Tolerance: Cache metrics locally for 5 minutes when collection fails, batch upload upon recovery to avoid data loss.

(2) Transmission Forwarding Layer: Reliable, Low-latency Metric Flow

Core Objective: Peak shaving and valley filling, cross-region transmission, prevent metric

  1. Technology Selection: Self-hosted Kafka cluster (alternative to managed messaging services) + Kafka MirrorMaker (cross-region sync);

  2. Deployment Architecture: - Within Region: Deploy 1 EC2 (i3.large, local SSD) in each of 3 AZs as Kafka Brokers, 1 EC2 as ZooKeeper (3-node cluster), Topics configured with 3 replicas (cross-AZ); - Cross-region Sync: Deploy Kafka MirrorMaker 2.0 to sync metric Topics from each region's Kafka to other regions, ensuring cross-region data consistency; - Topic Design: Split by metric type, such as monitor/aws/ec2, monitor/system/k8s, monitor/middleware/kafka, partitions per Topic = brokers in region;

  3. Key Guarantees: - Reliability: Kafka Topic data retention period 24 hours, supports metric replay; enable ACK=all to ensure data is written to all replicas before success; - Peak Shaving: Handle metric collection peaks (bulk reporting during system restarts), Kafka buffer queues prevent backend storage overload; | Encryption in Transit: Enable TLS 1.3 for Kafka node communication, SASL authentication + TLS encryption for collector→Kafka transmission.

(3) Storage Analysis Layer: Time-series Data Storage + Cross-region Aggregation

Core Objective: High-throughput writes, low-latency queries, long-term storage, global data aggregation

  1. Technology Selection: Self-hosted Prometheus cluster (real-time metrics) + Thanos (long-term storage + cross-region aggregation) + S3 (cold data archiving);

  2. Deployment Architecture: - Regional Prometheus: Deploy 1 highly available Prometheus cluster (2 Server nodes + shared storage) in each of 3 core regions (US East, EU West, Asia Pacific) to consume Kafka metrics via Kafka Adapter and write to Prometheus;

    • Storage Configuration: Local SSD storage for hot data (last 7 days), write throughput supporting 100k metrics/sec;
    • High Availability: 2 Prometheus Servers with identical collection rules, data synchronized via Thanos Sidecar to avoid single point of failure;
    • Thanos Cluster (Cross-region Aggregation):
    • Thanos Query: Deploy 1 EC2 (m5.large) per region to receive Grafana queries, aggregating data from local + other regions' Prometheus;
    • Thanos Store Gateway: Deployed in S3-access-optimized region, interfacing with S3 cold data storage (metrics older than 7 days), supporting historical data queries;
    • Thanos Compactor: Periodically compress historical metrics in S3 to reduce storage costs;
    • Long-term Storage: Prometheus hot data migrated to S3 via Thanos Uploader after 7 days, S3 enables cross-region replication (sync to 2 backup regions), data retained for 90 days (configurable).
  3. Query Optimization: - Metric Sharding: Shard storage by region + metric type, route queries only to target shards; - Cache Strategy: Thanos Query enables local caching (cache hotspot query results for 5 minutes) to reduce cross-region query latency; - Index Optimization: Prometheus enables inverted indexing to support fast filtering of metrics by resource ID (e.g., EC2 instance ID, K8s Pod name).

(4) Visualization Alerting Layer: Global Unified View + Real-time Alerts

Core Objective: Intuitive display, fast alerts, fault localization

  1. Technology Selection: Self-hosted Grafana (visualization) + Alertmanager (alerts) + Custom alert routing service;

  2. Deployment Architecture: - Grafana Cluster: Deploy 1 EC2 (t3.medium) in each of 3 regions, achieve global就近 access via NLB + Route53 intelligent DNS;

    • Data Sources: Connect to all regions' Thanos Query, support "Global Overview - Regional Details - Resource Details" three-level dashboards;
    • Access Control: Enable LDAP authentication (self-hosted OpenLDAP cluster), divide resource access permissions by role (admin/ops/dev);
    • Alertmanager Cluster: Deploy 1 EC2 (t3.small) per region, associated with Prometheus cluster, support alert grouping, silencing, inhibition;
    • Alert Routing: Self-developed alert forwarding service (deployed in K8s), support pushing alerts to email, collaboration systems, and ticketing systems, route to different responsible parties by fault level (P0-P3).
  3. Core Feature Adaptation: - Visualization Dashboards:

    • Global Overview: AWS resource availability, system service health, TOP5 alert types by region;
    • Resource Details: Real-time metric curves for EC2/ELB/S3 (CPU/memory/traffic), abnormal metric marking;
    • Fault Troubleshooting: Correlated metrics + logs (integrated with self-hosted ELK cluster), support retrospective analysis of data before/after faults by time range;
    • Alert Rules:
    • P0 (Critical): Region-level service crashes (e.g., Kafka cluster unavailable), core EC2 instance offline >30 seconds;
    • P1 (Severe): Resource utilization exceeding thresholds (CPU>90% for 5 minutes), database connections exceeding limits;
    • P2 (Moderate): Non-critical service response delay >1 second, S3 storage usage >80%;
    • P3 (Info): Resource configuration nearing expiration, non-critical metric fluctuations.

(5) Disaster Recovery and High Availability Layer: Seamless Failure Switching

Core Objective: Monitoring uninterrupted and data not lost during single-region failures

  1. Region-level Failover:
  1. Service-level Self-healing:

4. Core Monitoring Scenario Implementation

1. AWS Infrastructure Resource Health Monitoring

2. Self-hosted System Operation Monitoring

3. Fault Root Cause Localization

5. Security and Compliance Design

  1. Data Security:
  1. Compliance Adaptation:

6. Operations Automation Design (No Managed Ops Services)

  1. Infrastructure Orchestration:
  1. Service Automated Operations:

7. Core Metrics Assurance

Metric TypeTarget ValueImplementation Approach
System Availability99.99% (annual downtime<53 mins)Multi-region + multi-AZ deployment, automatic failover, service multi-replica
Metric Collection Latency<30 secondsLightweight collectors, Kafka low-latency transmission
Alert Triggering Latency<30 secondsPrometheus real-time rule evaluation, Alertmanager second-level forwarding
Cross-region Query Latency<1 secondThanos aggregation, cache optimization
Data Retention PeriodHot data 7 days, cold data 90 daysPrometheus + S3 tiered storage

Summary

This architecture is completely self-hosted based on AWS basic resources, with no managed service dependencies. Through a closed-loop design of "comprehensive collection → reliable transmission → layered storage → global aggregation → real-time alerting", it achieves unified monitoring of AWS infrastructure resources and self-hosted digital systems. Multi-region + multi-AZ deployment guarantees 99.99% availability, cross-region data synchronization and access adapt to global business, lightweight collection and optimized storage balance performance and cost, fully satisfying global monitoring needs for "systems + cloud resources" while adapting to security compliance requirements of foreign enterprises.