Monitoring system — PoC
Introduction
This article presents a proof of concept of the monitoring solution suggested in the article Hathor full node monitoring. It aims to serve as an example for developers implementing a Prometheus-based monitoring system for their full nodes.
This article is intended to be used in conjunction with two other articles, namely:
Overview
This proof of concept comprises the following components:
- Hathor full node
- Node exporter
- Prometheus
- Grafana
- Alertmanager
- Docker compose
Docker compose is used to orchestrate containers for the other five components. Thus, this proof of concept consists of a set of configuration files organized in the directory structure as follows:
poc/
├── prometheus/
│ ├── prometheus.yml
│ └── alerting_rules.yml
├── grafana/
│ ├── dashboards/
│ │ ├── hathor-core/
│ │ │ └── hathor_fullnodes.json
│ │ └── dashboards.yml
│ └── datasources/
│ └── prometheus.yml
├── alertmanager/
│ ├── config/
│ │ └── template_sns.tmpl
│ └── alertmanager.yml
└── docker-compose.yml
The following sections discuss the configuration files for Docker compose, Prometheus, Grafana and Alertmanager. Hathor full node and Node exporter do not require separate configuration files, as all their execution parameters are already configured within Docker compose. Now, to download the source code of this proof of concept, that is, the configuration files presented throughout this article, use hathor-monitoring-system-poc.zip.
<Placeholders>
: in the code samples of this article, as in all Hathor docs, <placeholders>
are always wrapped by angle brackets < >
. You shall interpret or replace a <placeholder>
with a value according to the context. Whenever replacing a <placeholder>
like this one with a value, do not wrap the value with quotes. Quotes, when necessary, will be indicated, wrapping the "<placeholder>"
like this one.
Note that many configurations depend on the deployment environment. In this proof of concept, the case of AWS is considered.
Docker compose
In the root of the poc
directory, the docker-compose.yml
file is located. This file is the only configuration file required to orchestrate containers for all five components of the proof of concept. For example:
poc/docker-compose.yml
services:
hathor-core:
image: hathornetwork/hathor-core
command: run_node
ports:
- "8080:8080""
volumes:
- <absolute_path_hathor_full_node>/data:/data
environment:
HATHOR_TESTNET=true
HATHOR_STATUS=8080
HATHOR_WALLET_INDEX=true
HATHOR_CACHE=true
HATHOR_CACHE_SIZE=100000
HATHOR_DATA=/data
HATHOR_PROMETHEUS=true
HATHOR_PROMETHEUS_PREFIX='hathor_core:'
node_exporter:
image: quay.io/prometheus/node-exporter:latest
container_name: node_exporter
command:
- '--path.rootfs=/host'
- '--collector.textfile.directory=/host/data/prometheus'
ports:
- "9100:9100"
pid: host
restart: unless-stopped
volumes:
- <absolute_path_hathor_full_node>:/host:ro,rslave
networks:
- monitoring
prometheus:
image: prom/prometheus
volumes:
- ./prometheus:/etc/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
environment:
AWS_ACCESS_KEY_ID=<string>
AWS_SECRET_ACCESS_KEY=<string>
ports:
- '9090:9090'
networks:
- monitoring
grafana:
image: grafana/grafana
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
ports:
- '3000:3000'
environment:
GF_SECURITY_ADMIN_PASSWORD=<admin_password>
GF_PATHS_PROVISIONING=/etc/grafana/provisioning
networks:
- monitoring
alertmanager:
image: prom/alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
ports:
- '9093:9093'
networks:
- monitoring
networks:
monitoring:
Prometheus
The prometheus
directory contains the two configuration files required for running Prometheus:
prometheus.yml
alerting_rules.yml
prometheus.yml
specifies the overall configuration for the execution of Prometheus. For example:
poc/prometheus/prometheus.yml
global:
scrape_interval: 15s
rule_files:
- /etc/prometheus/alerting_rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'local-nodes'
static_configs:
- targets: ['node_exporter:9100']
labels:
# These labels are used by Grafana
network: testnet
instance_name: hathor-full-node-testnet-local
- job_name: 'aws-nodes'
ec2_sd_configs:
- region: <string>
alerting_rules.yml
specifies the alarms and alerts to be created in Prometheus. For example:
poc/prometheus/alerting_rules.yml
groups:
- name: hathor-full-nodes-blocks.rules
rules:
- alert: FullNodeBlocksWarning
expr: increase(hathor_core:blocks{job='aws-nodes'}[5m]) < 1
for: 20m
labels:
application: hathor-core
severity: warning
annotations:
summary: Fullnode blocks not syncing - {{ $labels.instance_name }}
description: "The Fullnode has not received any blocks for 25 minutes \n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-disk.rules
rules:
- alert: FullNodeUsedDiskSpaceWarning
expr: ((node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'} - node_filesystem_avail_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'}) / (node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'})) * 100 > 85
for: 10m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode used disk space - {{ $labels.instance_name }}
description: "More than 85% of the disk space has been used\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-cpu.rules
rules:
- alert: FullNodeCpuUsageWarning
# The offset is used to ignore the first 6h of metrics in recently created full-nodes, since their initial syncing process could use a lot of CPU
expr: 1 - rate(node_cpu_seconds_total{mode='idle',job='aws-nodes'}[5m]) > 0.85 and ON(instance) (node_cpu_seconds_total{mode='idle',job='aws-nodes'} offset 6h) > 0
for: 15m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode high cpu usage - {{ $labels.instance_name }}
description: "The cpu usage is higher than 85%\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-memory.rules
rules:
- alert: FullNodeMemoryUsageMajor
expr: ((node_memory_MemTotal_bytes{job='aws-nodes'} - node_memory_MemFree_bytes{job='aws-nodes'} - node_memory_Cached_bytes{job='aws-nodes'} - node_memory_Buffers_bytes{job='aws-nodes'} - node_memory_Slab_bytes{job='aws-nodes'}) / (node_memory_MemTotal_bytes{job='aws-nodes'} )) * 100 > 95
for: 5m
labels:
application: hathor-core
severity: major
annotations:
summary: FullNode memory usage too high - {{ $labels.instance_name }}
description: "The memory usage is higher than 95%\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
Grafana
The grafana
directory contains the three configuration files required for running Grafana:
datasources/prometheus.yml
dashboards/dashboards.yml
dashboards/hathor-core/hathor_fullnodes.json
datasources/prometheus.yml
specifies how Grafana connects to Prometheus as a data source. For example:
poc/grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Note that one can only use prometheus
in the http://prometheus:9090
URL when using Docker compose. Otherwise, one needs to use the network address of the Prometheus server.
dashboards/dashboards.yml
specifies the dashboard configuration in Grafana. For example:
poc/grafana/dashboards/dashboards.yml
apiVersion: 1
providers:
# A unique provider name: <string> (required)
- name: '<string>'
# Org id: <int> (default to 1)
orgId: 1
# Name of the dashboard folder: <string>
folder: ''
# Folder UID: <string> (automatically generated if not specified)
folderUid: ''
# Provider type: <string> (default to 'file')
type: file
# Disable dashboard deletion: <bool>
disableDeletion: false
# How often Grafana scan for dashboards' updates: <int>
updateIntervalSeconds: 30
# Allow updating provisioned dashboards from UI: <bool>
allowUiUpdates: false
options:
# Path to dashboard files on disk: <string> (required when using 'file')
path: /etc/grafana/provisioning/dashboards
# Use folder names from filesystem to create folders in Grafana: <bool>
foldersFromFilesStructure: true
dashboards/hathor-core/hathor_fullnodes.json
is the source code for the the dashboard created by Hathor Labs to facilitate day-to-day operation of full nodes. To get this dashboard, refer to Hathor full node public dashboard.
Alertmanager
The alertmanager
directory contains two configuration files for running Alertmanager.
alertmanager.yml
config/template_sns.tmpl
alertmanager.yml
is required and specifies the overall configuration for the execution of Alertmanager. For example:
poc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: hathor-alert-manager-sns
group_by: ['alertname', 'application', 'severity', 'environment']
group_wait: 5s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: hathor-alert-manager-sns
sns_configs:
- api_url: https://sns.us-east-1.amazonaws.com
sigv4:
region: us-east-1
access_key: <string>
secret_key: <string>
topic_arn: arn:aws:sns:us-east-1:1234567890:your-sns-topic-name
subject: '{{ template "sns.hathor.subject" . }}'
message: '{{ template "sns.hathor.text" . }}'
attributes:
application: '{{ or .CommonLabels.application "-" }}'
chart: '{{ or .CommonAnnotations.link "-" }}'
runbook: '{{ or .CommonAnnotations.runbook "-" }}'
severity: '{{ or .CommonLabels.severity "-" }}'
source: prometheus
status: '{{ or .Status "-" }}'
title: '{{ or .CommonLabels.alertname "-" }}'
templates:
- /etc/alertmanager/config/*.tmpl
Note that beyond configuring this configuration file, it is also necessary to configure each notification receiver. In this proof of concept, the only receiver is AWS SNS.
config/template_sns.tmpl
is optional and specifies a template for notification messages dispatched by Alertmanager to its defined receivers. For example:
poc/alertmanager/config/template_sns.tmpl
# template_sns.tmpl
{{ define "sns.hathor.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}{{ end }}
{{ define "sns.hathor.text" }}
{{- $root := . -}}
{{ template "sns.hathor.subject" . }}
{{ range .Alerts }}
*Severity:* `{{ .Labels.severity }}`
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Chart:*: {{ .Annotations.link }}
*Runbook:* {{ .Annotations.runbook }}
*Details:*
{{ range .Labels.SortedPairs }} - *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}
What's next?
-
Hathor full node monitoring: to know how to implement a monitoring solution for your full node(s).
-
Hathor full node metrics: reference material regarding metrics tracked by the full node.
-
How to Install a Monitoring System for Hathor Full Node: step-by-step to install a Prometheus-based monitoring system.
-
Hathor full node pathway: to know how to operate a full node.