Prometheus 指標收集

Vibe Prompt

「幫我寫 Prometheus 設定檔：監控 K8s 叢集的節點、Pod、Deployment，每 15 秒抓一次。」

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

關鍵指標

# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 記憶體使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Pod 重啟次數
rate(kube_pod_container_status_restarts_total[5m])

# 磁碟空間
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

關鍵要點

✅ Prometheus 使用 Pull 模型（主動抓取），而非 Push 模型
✅ Metric 四種類型：Counter（累計）、Gauge（波動）、Histogram（分布）、Summary（分位數）
✅ PromQL 是查詢語言，支援聚合、運算、時間範圍
✅ rate() 將 Counter 轉為每秒速率，是最常用的函數
✅ 90% 的監控需求可以用 node_exporter + 四金信號搞定

四金信號 (Four Golden Signals)

| 信號 | PromQL 範例 | 意義 | |------|------------|------| | 延遲 | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) | 99% 請求在多少時間內完成 | | 流量 | sum(rate(http_requests_total[5m])) | 每秒請求數 | | 錯誤 | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) | 錯誤率 | | 飽和度 | node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes | 資源使用率 |

實用 RECORDING RULES

groups:
  - name: default
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: instance:node_cpu_utilization:avg5m
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

常見錯誤

程式碼範例

Prometheus：監控的心臟

Prometheus 是開源的監控系統和時間序列資料庫。它定期從目標服務抓取指標資料，儲存並提供查詢。

核心概念

| 概念 | 說明 | |:----|:----| | Metric | 指標名稱 + 標籤（label） | | Target | 被監控的服務（/metrics endpoint） | | Alert | 基於規則的警報（例如 CPU > 80%） | | Exporters | 轉換第三方系統的指標格式 |

常用的 Metric 類型

| 類型 | 說明 | 範例 | |:----|:----|:----| | Counter | 只增不減的計數器 | 請求總數、錯誤總數 | | Gauge | 可增可減 | CPU 使用率、連線數 | | Histogram | 分佈統計 | 請求延遲分佈 |

下一章預告：Grafana

Prometheus 收集資料，Grafana 把資料變成圖表。