设计监控告警系统

问题

如何用 Python 设计一个应用监控告警系统？Prometheus 指标如何埋点？

答案

架构

Prometheus 指标埋点

monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, Info

# 请求计数
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)

# 请求延迟（直方图）
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)

# 活跃连接数（当前值）
ACTIVE_CONNECTIONS = Gauge(
    "active_connections",
    "Number of active connections",
)

# 应用信息
APP_INFO = Info("app", "Application information")
APP_INFO.info({"version": "1.2.0", "env": "production"})

FastAPI 中间件

monitoring/middleware.py
import time
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import make_asgi_app

class MetricsMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path

        ACTIVE_CONNECTIONS.inc()
        start = time.perf_counter()

        try:
            response = await call_next(request)
            status = response.status_code
        except Exception:
            status = 500
            raise
        finally:
            duration = time.perf_counter() - start
            REQUEST_COUNT.labels(method, endpoint, status).inc()
            REQUEST_LATENCY.labels(method, endpoint).observe(duration)
            ACTIVE_CONNECTIONS.dec()

        return response

# 挂载 /metrics 端点
app = FastAPI()
app.add_middleware(MetricsMiddleware)
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

自定义业务指标

monitoring/business.py
from prometheus_client import Counter, Histogram

# 业务指标
ORDER_CREATED = Counter("orders_created_total", "Total orders created", ["product_type"])
PAYMENT_AMOUNT = Histogram(
    "payment_amount_yuan",
    "Payment amount distribution",
    buckets=[10, 50, 100, 500, 1000, 5000],
)

def create_order(order):
    # 业务逻辑...
    ORDER_CREATED.labels(product_type=order.type).inc()
    PAYMENT_AMOUNT.observe(order.amount)

健康检查

monitoring/health.py
from fastapi import FastAPI
import redis
from sqlalchemy import text

@app.get("/health")
async def health_check():
    checks = {}
    # 数据库
    try:
        db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "error"

    # Redis
    try:
        redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "error"

    all_ok = all(v == "ok" for v in checks.values())
    return {"status": "healthy" if all_ok else "unhealthy", "checks": checks}

@app.get("/ready")
async def readiness():
    """就绪探针：是否可以接收流量"""
    return {"status": "ready"}

Alertmanager 告警规则

prometheus/alert_rules.yml
groups:
  - name: python-app
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "5xx 错误率超过 5%"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "P95 延迟超过 2 秒"

常见面试问题

Q1: 四种指标类型？

答案：

类型	说明	用途
Counter	只增不减	请求数、错误数
Gauge	可增可减	连接数、温度
Histogram	分桶统计	延迟分布、大小分布
Summary	客户端分位数	延迟 P99

Q2: 监控三大支柱？

答案：

Metrics：数值指标（Prometheus + Grafana）
Logging：日志（ELK / Loki）
Tracing：链路追踪（Jaeger / Zipkin）

三者结合：从 Dashboard 发现异常指标 → 查链路追踪定位服务 → 看日志找根因

Q3: SLI / SLO / SLA 的区别？

答案：

SLI（服务级别指标）：可用率、延迟 P99
SLO（服务级别目标）：SLI 的目标值，如可用率 > 99.9%
SLA（服务级别协议）：对外承诺，违反有赔偿

问题​

答案​

架构​

Prometheus 指标埋点​

FastAPI 中间件​

自定义业务指标​

健康检查​

Alertmanager 告警规则​

常见面试问题​

Q1: 四种指标类型？​

Q2: 监控三大支柱？​

Q3: SLI / SLO / SLA 的区别？​

相关链接​

问题

答案

架构

Prometheus 指标埋点

FastAPI 中间件

自定义业务指标

健康检查

Alertmanager 告警规则

常见面试问题

Q1: 四种指标类型？

Q2: 监控三大支柱？

Q3: SLI / SLO / SLA 的区别？

相关链接