Prometheus 监控系统
概述
Prometheus 是一个开源的系统监控和报警工具包,最初由 SoundCloud 开发,现由 Cloud Native Computing Foundation(CNCF) 维护。它通过拉取(Pull)模型和时间序列数据库(TSDB)来收集和存储指标数据。
核心组件
1. Prometheus Server
- Retrieval:从各个目标收集指标
- Storage:存储时间序列数据
- Query Language:PromQL 查询语言
- Rule evaluation:评估警报规则
2. Exporters
将现有程序包装成支持 Prometheus 指标的格式,如:
- node_exporter:服务器指标
- mysqld_exporter:MySQL 指标
- redis_exporter:Redis 指标
3. Alertmanager
处理警报的分组、去重、静默和抑制,支持多种通知渠道。
4. Pushgateway
允许临时任务推送其指标,这些任务通常不会持续暴露给 Prometheus 拉取。
安装和配置
1. 安装 Prometheus
# 创建 prometheus 用户
sudo useradd --no-create-home --shell /bin/false prometheus
# 创建必要目录
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
# 设置所有权
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# 下载并解压 Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar -xvf prometheus-2.40.0.linux-amd64.tar.gz
# 进入解压后的目录
cd prometheus-2.40.0.linux-amd64
# 移动二进制文件
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
# 设置所有者为 prometheus 用户
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
# 复制配置文件和配置目录
sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries
2. Prometheus 配置文件
创建主配置文件 /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s # 设置全局默认抓取间隔
evaluation_interval: 15s # 规则评估间隔
# 规则文件路径
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# 目标列表
scrape_configs:
# 监控 prometheus 本身
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# 监控节点
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
# 监控服务
- job_name: "service"
static_configs:
- targets: ["localhost:8080", "localhost:8081"]
# 动态服务发现
- job_name: "services-kubernetes"
kubernetes_sd_configs:
- api_server: null
role: pod
kubeconfig_file: ""
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
alerting:
alertmanagers:
- static_configs:
- targets:
# - "alertmanager:9093"
3. 创建 Systemd 服务
创建文件 /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
StartLimitIntervalSec=500
StartLimitBurst=5
[Service]
Type=simple
Restart=on-failure
RestartSec=5s
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--storage.tsdb.retention.time=200h \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
4. 启动 Prometheus
# 重新加载 systemd 配置
sudo systemctl daemon-reload
# 启动 Prometheus 服务
sudo systemctl start prometheus
# 验证 Prometheus 是否正在运行
sudo systemctl status prometheus
# 设置 Prometheus 在启动时自动运行
sudo systemctl enable prometheus
常用 Exporters
Node Exporter(服务器指标)
# 创建用户
sudo useradd --no-create-home --shell /bin/false node_exporter
# 下载 node_exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.40.0.linux-amd64.tar.gz
# 复制二进制文件
sudo cp node_exporter-1.4.0.linux-amd64/node_exporter /usr/local/bin
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# 创建 systemd 服务文件
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
启动服务:
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
告警规则配置
示例告警规则(rules.yml)
groups:
- name: example
rules:
# 指标超过阈值
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "{{ $labels.instance }} has had high CPU usage (> 80%) for the last 2 minutes."
# 服务宕机
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} job {{ $labels.instance }} has been down for more than 1 minute."
# 内存使用率过高
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "{{ $labels.instance }} has had high memory usage (> 90%) for the last 2 minutes."
Prometheus 查询语言(PromQL)
基本查询示例
# 瞬时向量选择器
up # 检查所有被监控的目标的up状态
# 区间向量选择器
rate(http_requests_total[5m]) # 计算过去5分钟内的请求速率
# 聚合操作
sum(cpu_usage_percent) # 所有CPU使用百分比的总和
avg(cpu_usage_percent) # 所有CPU使用百分比的平均值
max_over_time(up[1h]) # 过去1小时内UP的最大值
最佳实践
配置最佳实践
- 合理设置抓取间隔以减少服务器压力
- 正确设置标签来标识实例和作业
- 对于短暂任务使用 Pushgateway
- 适当地配置规则和告警
- 使用 relabeling 功能过滤或转换标签
性能优化建议
- 合理配置保留策略避免存储过度增长
- 使用高效的标签模式避免标签爆炸
- 在告警规则中使用
for子句减少抖动 - 谨慎设置聚合函数的使用,特别是在高基数数据上
监控可视化
Prometheus 通常与其他工具配合使用进行可视化,如 Grafana,可以用来展示收集的指标。Prometheus 还自带一个简单的内置表达式浏览器,可以通过 http://YOUR_SERVER_IP:9090/graph 访问。