前言:本文将介绍使用docker-compose部署搭建promtheus监控容器、主机、服务等相关状态; 配合granfana面板构建监控大屏; 由于grafana的报警不是很友好,使用dingtalk,配合altermanager,实现钉钉报警。
一、编写docker-compose(开门见山)docker环境搭建不再介绍,网上已经一大堆。 首先介绍一下需要部署的组件: - prometheus:监控核心组件
- cadvisor:用于获取docker容器的指标
- node-exporter :用户获取服务器的指标
- grafana:监控图表好用的可视化组件
- alertmanager:告警组件
- dingtalk:alert告警不支持钉钉,需要借助dingtalk插件
首先创建一个prometheus目录,用来放docker-compose文件已经集群中需要挂载的配置文件。 在prometheus下面创建两个目录 prome:用来存放prometheus相关配置文件 alert:用来存放报警相关配置文件 直接上docker-compose.yml文件 version: '2'networks: monitor: driver: bridgeservices: prometheus: image: prom/prometheus container_name: prometheus hostname: prometheus restart: always command: - '--config.file=/etc/prometheus/prometheus.yml' - '--web.enable-lifecycle' - '--storage.tsdb.retention.time=30d' volumes: - ./prome:/etc/prometheus ports: - "29011:9090" networks: - monitor alertmanager: image: prom/alertmanager container_name: alertmanager hostname: alertmanager restart: always volumes: - /home/docker/prometheus/alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml ports: - "29012:9093" environment: - TZ=Asia/Shanghai networks: - monitor grafana: image: grafana/grafana container_name: grafana hostname: grafana restart: always ports: - "29013:3000" networks: - monitor node-exporter: image: quay.io/prometheus/node-exporter container_name: node-exporter hostname: node-exporter restart: always ports: - "29014:9100" networks: - monitor cadvisor: image: google/cadvisor:latest container_name: cadvisor hostname: cadvisor restart: always volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /home/docker/:/var/lib/docker:ro ports: - "29015:8080" networks: - monitor dingtalk: image: timonwong/prometheus-webhook-dingtalk container_name: dingtalk hostname: dingtalk restart: always volumes: - ./alert/config.yml:/etc/prometheus-webhook-dingtalk/config.yml - ./alert/dingtalk.tmpl:/opt/dingtalk/template/dingtalk.tmpl ports: - "29016:8060" environment: - TZ=Asia/Shanghai networks: - monitor
二、prometheus相关配置文件注:以下文件地址及命令均为自己随机命名,大家可以自行命令,对应好配置文件中的地址引用即可 1、prometheus/prome/promethues.yml文件是prometheus的配置文件,用来配置一些组件及监控信息,简单如下,需要将ip替换成自己实际的ip地址。 global: scrape_interval: 15s evaluation_interval: 15salerting: alertmanagers: - static_configs: - targets: ['ip:29012']rule_files: - "/etc/prometheus/rules/*.rules"scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['ip:29011'] - job_name: 'cadvisor' static_configs: - targets: ['ip:29015'] 2、在prometheus中定义报警规则,达到报警条件,就会通知alertmanager组件进行报警 prometheus/prome/rules/promethues.yml groups:- name: 主机存活告警 # 命名 rules: - alert: 主机存活告警 # 命名 expr: up == 0 # 表达式,分析指标判定告警 for: 60s # 触发告警持续时间 labels: # 自定义告警标签 severity: warning annotations: # 告警内容注释,根据需要制定 summary: "{{ $labels.instance }} 宕机超过1分钟!"- name: 主机内存使用率告警 rules: - alert: 主机内存使用率告警 expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80 for: 1m labels: severity: warning annotations: summary: "内存利用率大于80%, 实例: {{ $labels.instance }},当前值:{{ $value }}%"- name: 主机CPU使用率告警 rules: - alert: 主机CPU使用率告警 expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 70 for: 1m labels: severity: warning annotations: summary: "CPU近10分钟使用率大于70%, 实例: {{ $labels.instance }},当前值:{{ $value }}%"- name: 主机磁盘使用率告警 rules: - alert: 主机磁盘使用率告警 expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80 for: 1m labels: severity: warning annotations: summary: "磁盘使用率大于80%, 实例: {{ $labels.instance }},当前值:{{ $value }}%"
三、报警模块相关配置文件
1、alertmanager相关alertmanager是prometheus依赖的报警组件,所有的报警消息均是依赖alertmanager进行报警。 由于要配置钉钉报警,默认alertmanager不支持钉钉,需要引用dingtalk组件,以下配置文件中需要配置dingtalkd的url prometheus/alert/alertmanager.yml global: # 每5分钟检查一次是否恢复 resolve_timeout: 5m# route用来设置报警的分发策略route: # 采用哪个标签来作为分组依据 group_by: ['alertname'] # 组告警等待时间。也就是告警产生后等待30s,如果有同组告警一起发出 group_wait: 30s # 两组告警的间隔时间 group_interval: 30s # 重复告警的间隔时间,减少相同告警的发送频率 repeat_interval: 1h # 设置默认接收人 receiver: 'webhook'receivers:- name: 'webhook' webhook_configs: - url: 'http://ip:29016/dingtalk/webhook/send' send_resolved: true
2、 dingtalk相关首先要添加钉钉报警机器人: 在钉钉上创建一个报警群,打开群设置,选择机器人。 添加一个自定义的机器人 选择加签,创建完成后,会生成机器人的接口,复制保存后用。 回到dingtalk组件中,配置相对应的钉钉机器人的接口。 prometheus/alert/config.yml ## Request timeout## timeout: 5s### Uncomment following line in order to write template from scratch (be careful!)##no_builtin_template: true### Customizable templates path#templates:#- '/opt/dingtalk/template/dingtalk.tmpl'### You can also override default template using `default_message`### The following example to use the 'legacy' template from v0.3.0##default_message:## title: '{{ template "legacy.title" . }}'## text: '{{ template "legacy.content" . }}'### Targets, previously was known as "profiles"targets: webhook: url: 'https://oapi.dingtalk.com/robot/send?access_token=????相对应的token?????' # secret for signature secret: '相对应的secrt' 然后创建报警的模板格式 prometheus/alert/dingtalk.tmpl {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{ end }}{{ define "__alert_list" }}{{ range . }}---{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}告警状态:{{ .Status }}告警级别:{{ .Labels.severity }}告警类型:{{ .Labels.alertname }}告警主机:{{ .Labels.instance }}告警详情:{{ .Annotations.description }}告警时间:{{ (.StartsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}{{ end }}{{ end }}{{ define "__resolved_list" }}{{ range . }}---{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}告警状态:{{ .Status }}告警级别:{{ .Labels.severity }}告警类型:{{ .Labels.alertname }}告警主机:{{ .Labels.instance }}告警详情:{{ .Annotations.description }}告警时间:{{ (.StartsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}恢复时间:{{ (.EndsAt.Add 28800e9).Format "2023-01-01 10:00:00" }}{{ end }}{{ end }}{{ define "default.title" }}{{ template "__subject" . }}{{ end }}{{ define "default.content" }}{{ if gt (len .Alerts.Firing) 0 }}**Prometheus故障告警**{{ template "__alert_list" .Alerts.Firing }}---{{ end }}{{ if gt (len .Alerts.Resolved) 0 }}**Prometheus故障恢复**{{ template "__resolved_list" .Alerts.Resolved }}{{ end }}{{ end }}{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}{{ template "default.title" . }}{{ template "default.content" . }}
四、运行docker-compose然后运行docker-compose。所有的容器和配置都会启动 正常所有的容器都会拉起来,如果遇到状态为restarting,可能有问题,需要docker logs查看下具体报错信息,相对应解决。 下载地址: nginx开启ws访问和4层负载的编译参数示例 Linux系统用户如何添加到用户组 |