设计监控告警系统
问题
如何用 Python 设计一个应用监控告警系统?Prometheus 指标如何埋点?
答案
架构
Prometheus 指标埋点
monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, Info
# 请求计数
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
# 请求延迟(直方图)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# 活跃连接数(当前值)
ACTIVE_CONNECTIONS = Gauge(
"active_connections",
"Number of active connections",
)
# 应用信息
APP_INFO = Info("app", "Application information")
APP_INFO.info({"version": "1.2.0", "env": "production"})
FastAPI 中间件
monitoring/middleware.py
import time
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import make_asgi_app
class MetricsMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
method = request.method
endpoint = request.url.path
ACTIVE_CONNECTIONS.inc()
start = time.perf_counter()
try:
response = await call_next(request)
status = response.status_code
except Exception:
status = 500
raise
finally:
duration = time.perf_counter() - start
REQUEST_COUNT.labels(method, endpoint, status).inc()
REQUEST_LATENCY.labels(method, endpoint).observe(duration)
ACTIVE_CONNECTIONS.dec()
return response
# 挂载 /metrics 端点
app = FastAPI()
app.add_middleware(MetricsMiddleware)
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
自定义业务指标
monitoring/business.py
from prometheus_client import Counter, Histogram
# 业务指标
ORDER_CREATED = Counter("orders_created_total", "Total orders created", ["product_type"])
PAYMENT_AMOUNT = Histogram(
"payment_amount_yuan",
"Payment amount distribution",
buckets=[10, 50, 100, 500, 1000, 5000],
)
def create_order(order):
# 业务逻辑...
ORDER_CREATED.labels(product_type=order.type).inc()
PAYMENT_AMOUNT.observe(order.amount)
健康检查
monitoring/health.py
from fastapi import FastAPI
import redis
from sqlalchemy import text
@app.get("/health")
async def health_check():
checks = {}
# 数据库
try:
db.execute(text("SELECT 1"))
checks["database"] = "ok"
except Exception:
checks["database"] = "error"
# Redis
try:
redis_client.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "error"
all_ok = all(v == "ok" for v in checks.values())
return {"status": "healthy" if all_ok else "unhealthy", "checks": checks}
@app.get("/ready")
async def readiness():
"""就绪探针:是否可以接收流量"""
return {"status": "ready"}
Alertmanager 告警规则
prometheus/alert_rules.yml
groups:
- name: python-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "5xx 错误率超过 5%"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "P95 延迟超过 2 秒"
常见面试问题
Q1: 四种指标类型?
答案:
| 类型 | 说明 | 用途 |
|---|---|---|
| Counter | 只增不减 | 请求数、错误数 |
| Gauge | 可增可减 | 连接数、温度 |
| Histogram | 分桶统计 | 延迟分布、大小分布 |
| Summary | 客户端分位数 | 延迟 P99 |
Q2: 监控三大支柱?
答案:
- Metrics:数值指标(Prometheus + Grafana)
- Logging:日志(ELK / Loki)
- Tracing:链路追踪(Jaeger / Zipkin)
三者结合:从 Dashboard 发现异常指标 → 查链路追踪定位服务 → 看日志找根因
Q3: SLI / SLO / SLA 的区别?
答案:
- SLI(服务级别指标):可用率、延迟 P99
- SLO(服务级别目标):SLI 的目标值,如可用率 > 99.9%
- SLA(服务级别协议):对外承诺,违反有赔偿