Microservices Monitoring & Performance Report
System Status
All four golden signals remain within SLO targets this week. P99 latency continued its downward trend following the order-svc query optimization shipped in v2.1.0. Throughput growth reflects a seasonal uptick in order volume heading into the weekend peak window.
Performance Trends
Response Latency Trends (7-day)
P50 / P99 / P999 Latency (ms) — 2025-03-08 to 2025-03-14
Per-Service Error Rates (7-day average)
Error Rate (%) — 7-day average per microservice
Service Inventory
| Service | Status | P99 Latency | Uptime (7d) | Instances | Team |
|---|---|---|---|---|---|
| api-gateway | HEALTHY | 9ms | 100.00% | 6 | Platform |
| auth-svc | HEALTHY | 5ms | 100.00% | 4 | Identity |
| order-svc | DEGRADED | 42ms | 99.91% | 8 | Commerce |
| user-svc | HEALTHY | 8ms | 100.00% | 4 | Identity |
| payment-svc | HEALTHY | 11ms | 100.00% | 4 | Finance |
| notify-svc | HEALTHY | 3ms | 100.00% | 2 | Platform |
Top 5 Active Alerts
- order-svc CPU utilization exceeded 60% threshold for 45 continuous minutes (Mar-11 14:22 UTC)
- order-svc P99 latency breached 40ms SLO target (Mar-11 14:35 UTC)
- api-gateway Single node 502 rate briefly spiked to 0.1% (Mar-09 03:17 UTC, duration: 2 min)
- PostgreSQL Primary instance slow query count exceeded 50/min (Mar-12 18:04 UTC)
- Redis Connection pool saturation reached 82% (Mar-13 20:31 UTC, duration: 8 min)
Optimization Backlog
- order-svc query paths lack a Redis caching layer, causing elevated read pressure on PostgreSQL
- api-gateway has not yet enabled HTTP/2; upgrading is projected to reduce connection overhead by ~15%
- notify-svc runs only 2 instances; expanding to 3 is required to satisfy HA policy
- Missing index on
orders.created_attriggers sequential scans on all date-range queries - Distributed trace sampling rate is 1%; increasing to 5% would improve incident diagnostic fidelity
Deployment Timeline
This week, v2.1.0 completed full production rollout, delivering order-svc query optimizations and auth-svc JWT refresh logic refactoring. The following records each deployment gate and decision point.
Call Chain Diagram
The diagram below shows the core request path service dependency topology. Arrows represent synchronous gRPC/HTTP call directions. Dashed lines indicate conditional cache reads.
Monitoring Queries
The following Go handler is used across all microservices as the standard health check endpoint, wired to both Kubernetes liveness and readiness probes. It performs lightweight dependency checks and returns structured JSON with per-component status, version, and uptime.
package health
import (
"encoding/json"
"net/http"
"time"
)
// Status constants used by all downstream consumers (Prometheus, k8s, PagerDuty).
const (
StatusOK = "ok"
StatusDegraded = "degraded"
StatusDown = "down"
)
// HealthResponse is the canonical health check payload.
type HealthResponse struct {
Status string `json:"status"`
Version string `json:"version"`
Uptime string `json:"uptime"`
Checks map[string]string `json:"checks"`
ReportedAt time.Time `json:"reported_at"`
}
var startTime = time.Now()
// Handler returns an http.HandlerFunc suitable for /healthz and /readyz routes.
// deps is a map of dependency name → check function (returns error on failure).
func Handler(version string, deps map[string]func() error) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
overall := StatusOK
checks := make(map[string]string, len(deps))
for name, checkFn := range deps {
if err := checkFn(); err != nil {
checks[name] = StatusDown + ": " + err.Error()
overall = StatusDegraded
} else {
checks[name] = StatusOK
}
}
resp := HealthResponse{
Status: overall,
Version: version,
Uptime: time.Since(startTime).Round(time.Second).String(),
Checks: checks,
ReportedAt: time.Now().UTC(),
}
statusCode := http.StatusOK
if overall != StatusOK {
statusCode = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.Header().Set("Cache-Control", "no-cache, no-store")
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(resp) //nolint:errcheck
}
}
// Example: wiring the handler in main.go
//
// http.Handle("/healthz", health.Handler("v2.1.1", map[string]func() error{
// "postgres": db.Ping,
// "redis": redisClient.Ping,
// }))
Operations Notices
max_idle from 10 to 20 will reduce connection wait time during peak load. Also consider enabling daily auto-reset for pg_stat_statements at 00:00 UTC to keep slow-query data fresh.
#platform-oncall.
Architecture Images
The production cluster runs on AWS us-east-1 across three availability zones (AZ-a / AZ-b / AZ-c), with six worker nodes per AZ (c6i.2xlarge, 8 vCPU / 32 GB RAM). All microservices are deployed as Kubernetes Deployments with topologySpreadConstraints ensuring cross-AZ pod distribution so a single-AZ outage does not impact service continuity.
Istio Service Mesh handles inter-service mTLS encryption, traffic management, and observability data collection. Envoy sidecar injection is at 100%; enabling Telemetry v2 reduced Prometheus metric collection latency by approximately 18%. All egress traffic to external dependencies routes through an Istio egress gateway for auditability.
The storage tier uses RDS for PostgreSQL 14 (Multi-AZ primary + two read replicas) and ElastiCache Redis 7.0 cluster (three primaries, three replicas). Database connections are proxied through PgBouncer in transaction-pool mode with a maximum of 200 connections.
layout=full · full-width display
System Overview
The system employs a Domain-Driven Design (DDD) microservice decomposition strategy, partitioning the business domain into six independent services: gateway, identity, commerce, user profile, payments, and notifications. Each service is independently deployed and scaled, communicating via synchronous gRPC (Protobuf) for request-response flows and Apache Kafka for asynchronous event propagation — decoupling critical business processes such as order creation → payment notification → inventory adjustment.
"When you design a distributed system, the problem you ultimately face is not a technology problem — it is a boundary problem. Service boundaries, data boundaries, failure boundaries. Draw the right boundaries and complexity is encapsulated in exactly the right places."
— Platform Engineering Team, Architecture Principles v3.0
The observability stack conforms to the OpenTelemetry specification, unifying the three telemetry signals: Prometheus handles metrics storage and alerting; Jaeger carries distributed traces; Loki aggregates structured logs. Alerts are routed through AlertManager to PagerDuty (P1/P2) and Slack (P3/P4). MTTD target: < 3 minutes. MTTR target: < 15 minutes.
Core Technology Stack
- Runtime: Go 1.22 (high-concurrency services) / Python 3.12 (data processing pipelines)
- Orchestration: Kubernetes 1.29 + Helm 3 + ArgoCD (GitOps)
- Service Mesh: Istio 1.21 (mTLS / traffic management / circuit breaking)
- Message Queue: Apache Kafka 3.7 (3 brokers, RF=3, min.insync.replicas=2)
- Observability: Prometheus + Grafana + Jaeger + Loki + OpenTelemetry Collector
- CI/CD: GitHub Actions → Docker Build → GHCR → ArgoCD Image Updater
This report is generated automatically by the report-exporter scheduled job, which pulls data from the Prometheus HTTP API. It is published every Monday at 08:00 UTC. If you spot data anomalies, contact the SRE on-call team or post in the #platform-metrics Slack channel.