当我们完成Kubernetes集群环境部署后,就需要提取Kubernets集群中POD的日志提取和监控。当集群内的N台服务器在Kubernets的管理下自动创建和销毁POD,但在这种情况下,我们就不方便及时获取所有POD和服务器的运行状态及资源消耗状态,给我们感觉是,驾驶着一辆没有仪表盘的跑车在高速公路上飙车,给人一种心慌的感觉。 在以前的工作中,用过Nagios,Cacti,zabbix等监控工具。但在Kubernets集群中,这些工具并不适用。因此,我们需要引入新的监控工具Prometheus。
Prometheus简介 Prometheus是SoundCloud开源的一款监控软件。它的实现参考了Google内部的监控实现, 与同样源自Google的Kubernetes项目十分搭配。Prometheus集成了数据采集,存储,异常告警多项功能,是一款一体化的完整方案。它针对大规模的集群环境设计了拉取式的数据采集方式、多维度数据存储格式以及服务发现等创新功能。 与传统监控工具相比,Prometheus 可以通过服务发现掌握集群内部已经暴露的监控点,然后主动拉取所有监控数据。通过这样的架构设计,我们仅需要向Kubernetes集群中部署一份Prometheus实例,它就可以通过向apiserver
查询集群状态,然后向所有已经支持Prometheus metrics的kubelet获取所有Pod的运行数据。如果我们想采集底层服务器运行状态,通过DaemonSet在所有服务器上运行配套的node-exporter之后,Prometheus就可以自动采集到新的这部分数据。 这种动态发现的架构,非常适合服务器和程序都不固定的Kubernetes集群环境,同时也大大降低了运维的负担。
Prometheus官网:https://prometheus.io/ Prometheus官方下载地址:https://prometheus.io/download/ Prometheus官方文档地址:https://prometheus.io/docs/introduction/overview/
环境说明 环境: Prometheus v2.2.0 node-exporter v0.15.2 Kubernetes v1.8.2 Centos 7.4
角色
IP
备注
k8s master
192.168.1.195
k8s master
k8s node
192.168.1.198
k8s node、Prometheus、node-exporter
k8s node
192.168.1.199
k8s node、Prometheus、node-exporter
部署node-exporter node-exporter可以用于监控底层的服务器指标 下面是官方解释:1
2
Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.
The WMI exporter is recommended for Windows users.
node-exporter Github:https://github.com/prometheus/node_exporter
为了能够收集每个节点的信息,这里使用DaemonSet
的形式部署PODS
node-exporter.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
namespace: kube-ops
labels:
k8s-app: node-exporter
spec:
template:
metadata:
labels:
k8s-app: node-exporter
spec:
containers:
- image: prom/node-exporter:latest
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
name: http
volumeMounts:
- name: time
mountPath: /etc/localtime
read Only: true
volumes:
- name: time
hostPath:
path: /etc/localtime
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
namespace: kube-ops
spec:
ports:
- name: http
port: 9100
targetPort: 9100
protocol: TCP
selector:
k8s-app: node-exporter
1
2
[root@localhost prometheus]
[root@localhost prometheus]
1
2
3
4
5
[root@localhost prometheus]
NAME READY STATUS RESTARTS AGE IP NODE
node-exporter-8d66t 1/1 Running 0 1h 172.30.41.5 192.168.1.199
node-exporter-xn5ss 1/1 Running 0 1h 172.30.57.6 192.168.1.198
[root@localhost prometheus]
1
2
3
4
[root@localhost prometheus]
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
node-exporter ClusterIP 172.16.152.14 <none> 9100/TCP 1h k8s-app=node-exporter
[root@localhost prometheus]
部署Service Account Kubernetes在1.8.0之后启用了RBAC特性,因此我们需要先通过RBAC授权,然后Prometheus通过RBAC连接Kubernetes集群,否则被拒绝后,将无法连接到K8s的API-SERVER 参考:https://kubernetes.io/docs/admin/authorization/rbac/
prometheus-service-account.yml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-ops
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
namespace: kube-ops
rules:
- apiGroups: ["" ]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get" , "list" , "watch" ]
- nonResourceURLs: ["/metrics" ]
verbs: ["get" ]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: kube-ops
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-ops
1
[root@localhost prometheus]
1
2
3
4
5
[root@localhost prometheus]
NAME SECRETS AGE
default 1 1d
prometheus 1 55m
[root@localhost prometheus]
部署Prometheus alertmanager配置文件 使用ConfigMap的形式来设置Prometheus的配置文件 参考:https://prometheus.io/docs/prometheus/latest/configuration/configuration/
prometheus-alertmanager-config.yml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-ops
data:
config.yml: |-
global:
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'user1@example.com'
smtp_auth_username: 'user1@example.com'
smtp_auth_password: 'password'
smtp_require_tls: false
resolve_timeout: 5m
templates:
- '/etc/alertmanager/*.tmpl'
route:
receiver: email
group_wait: 30s
group_interval: 5m
repeat_interval: 10d
group_by: [alertname]
routes:
- receiver: email
group_wait: 10s
match:
team: node
receivers:
- name: email
email_configs:
- send_resolved: true
to: 'user2@example.com,user3@example.com'
repeat_interval
:指定告警发送间隔时间
to: 'user2@example.com,user3@example.com'
指定多个收件人,每个收件人邮箱之间用逗号
隔开
1
[root@localhst prometheus]
1
2
3
4
[root@localhst prometheus]
NAME DATA AGE
alertmanager 2 21h
[root@localhst prometheus]
需要修改global.smtp*
和receivers.name.email_configs
相关的邮件信息
部署Prometheus的配置文件 prometheus-config.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-ops
data:
prometheus.yml: |
global:
scrape_interval: 30s
scrape_timeout: 30s
alerting:
alertmanagers:
- static_configs:
- targets: ["192.168.1.198:9093" ]
rule_files:
- "rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090' ]
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: 192.168.1.195:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1} /proxy/metrics
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: 192.168.1.195:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1} /proxy/metrics/cadvisor
- job_name: 'kubernetes-node-exporter'
scheme: http
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
rules.yml: |
groups:
- name: rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs" } - node_filesystem_free{device="rootfs" }) / node_filesystem_size{device="rootfs" } * 100 > 80
for : 2m
labels:
team: node
annotations:
summary: "{{$labels .instance}}: High Filesystem usage detected"
description: "{{$labels .instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for : 2m
labels:
team: node
annotations:
summary: "{{$labels .instance}}: High Memory usage detected"
description: "{{$labels .instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter" ,mode="idle" }[5m])) * 100)) > 80
for : 2m
labels:
team: node
annotations:
summary: "{{$labels .instance}}: High CPU usage detected"
description: "{{$labels .instance}}: CPU usage is above 80% (current value is: {{ $value }}"
job_name: 'kubernetes-node-exporter'
中替换31672端口为9100,该端口是node-exporter
暴露的NodePort
端口,这里需要根据实际情况填写 在前面node-exporter.yaml中指定了targetPort: 9100
,所以这里的端口需要修改为9100
kubernetes.default.svc:443
为k8s api地址,如果安装k8s时不是使用的默认DNS,则需要手动修改
新增了Prometheus alertmanagers,需要修改alerting.alertmanagers.static_configs.targets
的IP地址,这时Prometheus和alertmanagers是两个docker容器,IP为运行alertmanagers的宿主机的IP地址
新增了Prometheus alertmanagers告警规则,添加rule_files
(指定报警规则),并增加三条规则(rules.yml)
这里新增的三条报警规则分别是:节点的文件系统,节点内存,CPU的使用量。如果大于了80%的话就触发label为team=node
的receiver
(alertmanager 配置文件中配置),可以看到上面的配置就会匹配email
这个receiver
1
[root@localhost prometheus]
1
2
3
4
5
[root@localhst prometheus]
NAME DATA AGE
alertmanager 2 21h
prometheus-config 1 21h
[root@localhst prometheus]
部署prometheus 使用Deployment的形式来设置Prometheus 参考:https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
创建Node Label:1
2
3
4
5
6
7
[root@localhost ~]
node "192.168.1.198" labeled
[root@localhost ~]
NAME STATUS ROLES AGE VERSION
192.168.1.198 Ready <none> 24m v1.9.2
[root@localhost ~]
[root@localhost ~]
prometheus-deploy.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: prometheus
name: prometheus
namespace: kube-ops
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: prometheus
spec:
nodeSelector:
appNodes: pro-00-monitor
securityContext:
runAsUser: 0
serviceAccountName: prometheus
containers:
- image: prom/prometheus:v2.2.0
name: prometheus
command :
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=15d"
ports:
- containerPort: 9090
hostPort: 9090
protocol: TCP
name: http
volumeMounts:
- mountPath: "/prometheus"
name: data
subPath: prometheus/data
- mountPath: "/etc/prometheus"
name: config-volume
- mountPath: "/etc/localtime"
name: time
read Only: true
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
- image: prom/alertmanager:v0.14.0
name: alertmanager
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
ports:
- containerPort: 9093
hostPort: 9093
protocol: TCP
name: http
volumeMounts:
- name: alertmanager-config-volume
mountPath: /etc/alertmanager
resources:
requests:
memory: 500Mi
limits:
memory: 1024Mi
volumes:
- name: data
hostPath:
path: "/data/monitor"
- name: time
hostPath:
path: "/etc/localtime"
- configMap:
name: prometheus-config
name: config-volume
- name: alertmanager-config-volume
configMap:
name: alertmanager
1
[root@localhost prometheus]
1
2
3
[root@localhost prometheus]
prometheus-fc7685cc7-rwlc7 1/1 Running 0 34s 172.30.57.7 192.168.1.198
[root@localhost prometheus]
1
2
3
4
[root@localhst ~]
tcp6 0 0 :::9090 :::* LISTEN 4023/docker-proxy
tcp6 0 0 :::9093 :::* LISTEN 3983/docker-proxy
[root@localhst ~]
访问prometheus prometheus启动成功后,我们就可以打开prometheus dashboard查看了,访问 http://ip:9090/graph ,点status
–>Targets
可以看到prometheus已经成功访问到k8s api-server并获取到监控指标
访问Alertmanager alertmanager启动后,我们可以打开alertmanager Dashboard查看,访问 http://ip:9093 当然在prometheus dashboard也可以查看,在status –> Runtime & Build Information 最底部
告警规则 在Prometheus定义的rules生效后,可以在Status –>Rules 这里看到
点击的expr
会直接跳转到Prometheus graph
页面查询,在制定报警规则的时候,可以先在Prometheus中测试表达式
在Prometheus的Alerts这里可以看到触发告警规则的状态
目前有三台主机成功触发规则 一个报警信息在生命周期内有下面3中状态:
inactive
: 表示当前报警信息既不是firing
状态也不是pending
状态
pending
: 表示在设置的阈值时间范围内被激活。这时Prometheus处于等待状态,大概等待3分钟左右,整合所有的告警条目,等待集中发送给Alertmanager
firing
: 表示超过设置的阈值时间被激活。这时Prometheus处于发送告警到Alertmanager阶段,也是最终状态
触发规则后,也可以在Alertmanager Dashboard上看到
最后来一张,我们成功收到Alertmanager的告警邮件截图
查询监控数据 Prometheus提供了API
的方式进行数据查询,同样可以使用query语言
进行复杂的查询任务 点Graph
查询每个POD的CPU使用情况,输入:sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!="", pod_name!=""}[1m] ) )
更多查询条件参考:https://prometheus.io/docs/prometheus/latest/querying/basics/ https://prometheus.io/docs/prometheus/latest/querying/api/ https://prometheus.io/docs/prometheus/latest/querying/examples/
Q&A Question: 如果遇到下面的报错:
Answer: 在prometheus-deploy.yaml
中spec.spec.
下增加如下配置:1
2
securityContext:
runAsUser: 0
详细配置参考上面prometheus-deploy.yaml配置文件 参考:https://github.com/prometheus/prometheus/issues/2939
Question:1
2
3
level=error ts=2018-04-23T13:08:34.417214948Z caller =notify.go:303 component=dispatcher msg="Error on notify" err="dial t
cp 14.18.245.164:25: getsockopt: connection timed out"level=error ts=2018-04-23T13:08:34.417316796Z caller =dispatch.go:266 component=dispatcher msg="Notify for alerts failed"
num_alerts=3 err="dial tcp 14.18.245.164:25: getsockopt: connection timed out"
Answer: 遇到上面的报错,首先先检查下docker容器是否联网,再测试下与smtp服务器是否正常通讯
在测试过程中发现,腾讯邮箱(个人QQ邮件+企业邮箱),只支持SMTP SSL 465端口,且不支持TLS,smtp_require_tls
这个参数官方默认是true
,这里需要设置为 smtp_require_tls: false
,使用不同邮箱的SMTP 需要具体测试。
Prometheus监控Kubernetes HTTP集群 环境: Prometheus v1.0.1 node-exporter v0.15.2 Kubernetes v1.8.2
服务器环境如上
1
2
3
4
5
[root@localhost ~]
tcp 0 0 192.168.1.195:6443 0.0.0.0:* LISTEN 4555/kube-apiserver
tcp 0 0 192.168.1.195:8080 0.0.0.0:* LISTEN 4555/kube-apiserver
[root@localhost ~]
1
2
3
[root@localhost prometheus]
[root@localhost prometheus]
[root@localhost prometheus]
1
2
3
4
5
6
[root@localhost prometheus]
NAME READY STATUS RESTARTS AGE IP NODE
node-exporter-hjgds 1/1 Running 0 23m 172.30.41.5 192.168.1.199
node-exporter-zmlcg 1/1 Running 0 23m 172.30.57.6 192.168.1.198
prometheus-5f86bc8bc5-tbnxp 1/1 Running 0 10m 172.30.41.6 192.168.1.199
[root@localhost prometheus]
1
2
3
4
[root@localhost prometheus]
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
node-exporter ClusterIP 172.16.40.83 <none> 9100/TCP 24m k8s-app=node-exporter
[root@localhost prometheus]
1
2
3
4
[root@localhost prometheus]
NAME DATA AGE
prometheus-config 1 22m
[root@localhost prometheus]
其他的同上,Prometheus监控Kubernetes HTTP集群配置文件见附件 注:经测试,目前只发现Prometheus v1.0.1
支持,测试其他版本都有报错
参考:https://blog.qikqiak.com/post/kubernetes-monitor-prometheus-grafana/ https://blog.qikqiak.com/post/update-prometheus-2-in-kubernetes/ https://github.com/cnych/k8s-repo/tree/master/prometheus https://blog.qikqiak.com/post/alertmanager-of-prometheus-in-practice/ https://blog.csdn.net/qq_21398167/article/details/76008594?locationnum=10&fps=1 https://segmentfault.com/a/1190000008695463 https://prometheus.io/docs/alerting/overview/
附件:Prometheus监控TLS K8S配置文件.zip Prometheus监控K8S HTTP配置文件.zip
本文出自”Jack Wang Blog”:http://www.yfshare.vip/2018/03/14/Prometheus监控TLS-Kubernetes集群/