跳到主要内容

设计监控告警系统

问题

如何用 Go 设计一个完整的监控告警系统,涵盖指标采集、存储、可视化和告警?

答案

整体架构

指标类型(Prometheus 四类)

类型说明示例
Counter只增不减的计数器请求总数、错误次数
Gauge可增可减的仪表盘当前连接数、Goroutine 数
Histogram直方图,分桶统计分布请求延迟分布
Summary摘要,计算分位数P50/P99 延迟

埋点实现

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
// Counter: 请求总数
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "HTTP 请求总数",
},
[]string{"method", "path", "status"},
)

// Histogram: 请求延迟
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP 请求延迟(秒)",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
},
[]string{"method", "path"},
)

// Gauge: 当前活跃连接
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "当前活跃连接数",
},
)

// Gauge: Goroutine 数量
goroutineCount = promauto.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "goroutine_count",
Help: "当前 Goroutine 数量",
},
func() float64 { return float64(runtime.NumGoroutine()) },
)
)

Gin 中间件自动埋点

func PrometheusMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
path := c.FullPath() // 使用路由模板而非实际路径,避免高基数
if path == "" {
path = "unknown"
}

c.Next()

duration := time.Since(start).Seconds()
status := strconv.Itoa(c.Writer.Status())

httpRequestsTotal.WithLabelValues(c.Request.Method, path, status).Inc()
httpRequestDuration.WithLabelValues(c.Request.Method, path).Observe(duration)
}
}

func main() {
r := gin.New()
r.Use(PrometheusMiddleware())

// 暴露 /metrics 端点给 Prometheus 抓取
r.GET("/metrics", gin.WrapH(promhttp.Handler()))

r.GET("/api/users", getUsers)
r.Run(":8080")
}
标签基数陷阱

绝对不要用 userID、requestID 等高基数值作为 Prometheus 标签,否则会导致时间序列爆炸、内存暴涨。标签只用低基数值(method、status、service 等)。

自定义业务指标

// 业务指标:订单创建
var orderCreated = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "order_created_total",
Help: "订单创建数量",
},
[]string{"channel"}, // app, web, api
)

// 业务指标:支付延迟
var paymentDuration = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "payment_duration_seconds",
Help: "支付处理耗时",
Buckets: prometheus.DefBuckets,
},
)

func CreateOrder(ctx context.Context, req OrderReq) error {
orderCreated.WithLabelValues(req.Channel).Inc()

start := time.Now()
err := processPayment(ctx, req)
paymentDuration.Observe(time.Since(start).Seconds())

return err
}

SLI / SLO 定义

# 可用性 SLO: 99.9%
# SLI = 成功请求 / 总请求
- record: sli:availability
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# 延迟 SLO: P99 < 500ms
- record: sli:latency_p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

告警规则

# Prometheus 告警规则
groups:
- name: service-alerts
rules:
# 错误率 > 1% 持续 5 分钟
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "高错误率告警"
description: "错误率 {{ $value | humanizePercentage }}"

# Goroutine 数量异常
- alert: GoroutineLeak
expr: goroutine_count > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Goroutine 可能泄漏"

# P99 延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning

Go 中实现告警推送

// Alertmanager Webhook 接收器
type AlertWebhook struct {
Status string `json:"status"` // firing / resolved
Alerts []Alert `json:"alerts"`
}

type Alert struct {
Labels map[string]string `json:"labels"`
Annotations map[string]string `json:"annotations"`
StartsAt time.Time `json:"startsAt"`
}

func HandleAlert(c *gin.Context) {
var webhook AlertWebhook
if err := c.ShouldBindJSON(&webhook); err != nil {
c.JSON(400, gin.H{"error": err.Error()})
return
}

for _, alert := range webhook.Alerts {
msg := fmt.Sprintf("[%s] %s\n%s",
alert.Labels["severity"],
alert.Annotations["summary"],
alert.Annotations["description"],
)

// 按严重程度选择通知渠道
switch alert.Labels["severity"] {
case "critical":
sendDingTalk(msg) // 钉钉 + 电话
sendSMS(msg)
case "warning":
sendDingTalk(msg) // 仅钉钉
}
}
c.JSON(200, gin.H{"status": "ok"})
}

关键监控维度

维度指标
RED 方法Rate(请求速率)、Errors(错误率)、Duration(延迟)
USE 方法Utilization(利用率)、Saturation(饱和度)、Errors
运行时Goroutine 数、GC 暂停、内存、CPU
基础设施磁盘、网络、连接数
业务订单量、支付成功率、转化率

常见面试问题

Q1: Prometheus Pull vs Push 模式怎么选?

答案

  • Pull(Prometheus 默认):Prometheus 主动抓取目标的 /metrics。适合长期运行的服务
  • Push:服务主动推送到 Pushgateway。适合批处理、短生命周期 Job
  • Go 微服务推荐 Pull 模式,配合 Service Discovery 自动发现目标

Q2: Histogram 和 Summary 怎么选?

答案

  • Histogram:服务端分桶,Prometheus 聚合时可跨实例计算分位数;桶边界固定
  • Summary:客户端直接算分位数,不可跨实例聚合
  • 推荐 Histogram,因为多实例场景下可聚合

Q3: 监控告警如何避免"告警风暴"?

答案

  • 告警聚合group_by 相同告警,同类只发一条
  • 告警抑制inhibit_rules 高优先级告警抑制低优先级
  • 静默规则:维护窗口期间静默
  • 告警分级:critical 电话、warning 钉钉、info 仅记录

Q4: Go 运行时需要监控哪些指标?

答案

  • Goroutine 数量(检测泄漏)
  • GC 暂停时间和频率
  • 堆内存使用量 / 堆对象数
  • 线程数

promhttp.Handler() 默认暴露 go_* 前缀的运行时指标。

Q5: 如何做到秒级监控?

答案

  • Prometheus 的最小 scrape interval 通常 10~15s
  • 需要秒级,可以在应用内自行聚合 + 推送到时序数据库(InfluxDB / VictoriaMetrics)
  • 或使用 Datadog Agent 等商业方案

相关链接