Skip to main content

Monitoring system — PoC

Introduction

This article presents a proof of concept of the monitoring solution suggested in the article Hathor full node monitoring. It aims to serve as an example for developers implementing a Prometheus-based monitoring system for their full nodes.

info

This article is intended to be used in conjunction with two other articles, namely:

Overview

This proof of concept comprises the following components:

  • Hathor full node
  • Node exporter
  • Prometheus
  • Grafana
  • Alertmanager
  • Docker compose

Docker compose is used to orchestrate containers for the other five components. Thus, this proof of concept consists of a set of configuration files organized in the directory structure as follows:

poc/
├── prometheus/
│ ├── prometheus.yml
│ └── alerting_rules.yml
├── grafana/
│ ├── dashboards/
│ │ ├── hathor-core/
│ │ │ └── hathor_fullnodes.json
│ │ └── dashboards.yml
│ └── datasources/
│ └── prometheus.yml
├── alertmanager/
│ ├── config/
│ │ └── template_sns.tmpl
│ └── alertmanager.yml
└── docker-compose.yml

The following sections discuss the configuration files for Docker compose, Prometheus, Grafana and Alertmanager. Hathor full node and Node exporter do not require separate configuration files, as all their execution parameters are already configured within Docker compose. Now, to download the source code of this proof of concept, that is, the configuration files presented throughout this article, use hathor-monitoring-system-poc.zip.

tip

<Placeholders>: in the code samples of this article, as in all Hathor docs, <placeholders> are always wrapped by angle brackets < >. You shall interpret or replace a <placeholder> with a value according to the context. Whenever replacing a <placeholder> like this one with a value, do not wrap the value with quotes. Quotes, when necessary, will be indicated, wrapping the "<placeholder>" like this one.

Note that many configurations depend on the deployment environment. In this proof of concept, the case of AWS is considered.

Docker compose

In the root of the poc directory, the docker-compose.yml file is located. This file is the only configuration file required to orchestrate containers for all five components of the proof of concept. For example:

poc/docker-compose.yml
services:
hathor-core:
image: hathornetwork/hathor-core
command: run_node
ports:
- "8080:8080""
volumes:
- <absolute_path_hathor_full_node>/data:/data
environment:
HATHOR_TESTNET=true
HATHOR_STATUS=8080
HATHOR_WALLET_INDEX=true
HATHOR_CACHE=true
HATHOR_CACHE_SIZE=100000
HATHOR_DATA=/data
HATHOR_PROMETHEUS=true
HATHOR_PROMETHEUS_PREFIX='hathor_core:'

node_exporter:
image: quay.io/prometheus/node-exporter:latest
container_name: node_exporter
command:
- '--path.rootfs=/host'
- '--collector.textfile.directory=/host/data/prometheus'
ports:
- "9100:9100"
pid: host
restart: unless-stopped
volumes:
- <absolute_path_hathor_full_node>:/host:ro,rslave
networks:
- monitoring

prometheus:
image: prom/prometheus
volumes:
- ./prometheus:/etc/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
environment:
AWS_ACCESS_KEY_ID=<string>
AWS_SECRET_ACCESS_KEY=<string>
ports:
- '9090:9090'
networks:
- monitoring

grafana:
image: grafana/grafana
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
ports:
- '3000:3000'
environment:
GF_SECURITY_ADMIN_PASSWORD=<admin_password>
GF_PATHS_PROVISIONING=/etc/grafana/provisioning
networks:
- monitoring

alertmanager:
image: prom/alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
ports:
- '9093:9093'
networks:
- monitoring

networks:
monitoring:

Prometheus

The prometheus directory contains the two configuration files required for running Prometheus:

  • prometheus.yml
  • alerting_rules.yml

prometheus.yml specifies the overall configuration for the execution of Prometheus. For example:

poc/prometheus/prometheus.yml
global:
scrape_interval: 15s

rule_files:
- /etc/prometheus/alerting_rules.yml

alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

scrape_configs:
- job_name: 'local-nodes'
static_configs:
- targets: ['node_exporter:9100']
labels:
# These labels are used by Grafana
network: testnet
instance_name: hathor-full-node-testnet-local
- job_name: 'aws-nodes'
ec2_sd_configs:
- region: <string>

alerting_rules.yml specifies the alarms and alerts to be created in Prometheus. For example:

poc/prometheus/alerting_rules.yml
groups:
- name: hathor-full-nodes-blocks.rules
rules:
- alert: FullNodeBlocksWarning
expr: increase(hathor_core:blocks{job='aws-nodes'}[5m]) < 1
for: 20m
labels:
application: hathor-core
severity: warning
annotations:
summary: Fullnode blocks not syncing - {{ $labels.instance_name }}
description: "The Fullnode has not received any blocks for 25 minutes \n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-disk.rules
rules:
- alert: FullNodeUsedDiskSpaceWarning
expr: ((node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'} - node_filesystem_avail_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'}) / (node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'})) * 100 > 85
for: 10m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode used disk space - {{ $labels.instance_name }}
description: "More than 85% of the disk space has been used\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-cpu.rules
rules:
- alert: FullNodeCpuUsageWarning
# The offset is used to ignore the first 6h of metrics in recently created full-nodes, since their initial syncing process could use a lot of CPU
expr: 1 - rate(node_cpu_seconds_total{mode='idle',job='aws-nodes'}[5m]) > 0.85 and ON(instance) (node_cpu_seconds_total{mode='idle',job='aws-nodes'} offset 6h) > 0
for: 15m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode high cpu usage - {{ $labels.instance_name }}
description: "The cpu usage is higher than 85%\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue
- name: hathor-full-nodes-memory.rules
rules:
- alert: FullNodeMemoryUsageMajor
expr: ((node_memory_MemTotal_bytes{job='aws-nodes'} - node_memory_MemFree_bytes{job='aws-nodes'} - node_memory_Cached_bytes{job='aws-nodes'} - node_memory_Buffers_bytes{job='aws-nodes'} - node_memory_Slab_bytes{job='aws-nodes'}) / (node_memory_MemTotal_bytes{job='aws-nodes'} )) * 100 > 95
for: 5m
labels:
application: hathor-core
severity: major
annotations:
summary: FullNode memory usage too high - {{ $labels.instance_name }}
description: "The memory usage is higher than 95%\n VALUE = {{ $value }}\n"
link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
runbook: you can add here a link to a runbook with instructions on how to fix this issue

Grafana

The grafana directory contains the three configuration files required for running Grafana:

  • datasources/prometheus.yml
  • dashboards/dashboards.yml
  • dashboards/hathor-core/hathor_fullnodes.json

datasources/prometheus.yml specifies how Grafana connects to Prometheus as a data source. For example:

poc/grafana/datasources/prometheus.yml
apiVersion: 1

datasources:
- name: prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true

Note that one can only use prometheus in the http://prometheus:9090 URL when using Docker compose. Otherwise, one needs to use the network address of the Prometheus server.

dashboards/dashboards.yml specifies the dashboard configuration in Grafana. For example:

poc/grafana/dashboards/dashboards.yml
apiVersion: 1

providers:
# A unique provider name: <string> (required)
- name: '<string>'
# Org id: <int> (default to 1)
orgId: 1
# Name of the dashboard folder: <string>
folder: ''
# Folder UID: <string> (automatically generated if not specified)
folderUid: ''
# Provider type: <string> (default to 'file')
type: file
# Disable dashboard deletion: <bool>
disableDeletion: false
# How often Grafana scan for dashboards' updates: <int>
updateIntervalSeconds: 30
# Allow updating provisioned dashboards from UI: <bool>
allowUiUpdates: false
options:
# Path to dashboard files on disk: <string> (required when using 'file')
path: /etc/grafana/provisioning/dashboards
# Use folder names from filesystem to create folders in Grafana: <bool>
foldersFromFilesStructure: true

dashboards/hathor-core/hathor_fullnodes.json is the source code for the the dashboard created by Hathor Labs to facilitate day-to-day operation of full nodes. To get this dashboard, refer to Hathor full node public dashboard.

Alertmanager

The alertmanager directory contains two configuration files for running Alertmanager.

  • alertmanager.yml
  • config/template_sns.tmpl

alertmanager.yml is required and specifies the overall configuration for the execution of Alertmanager. For example:

poc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: hathor-alert-manager-sns
group_by: ['alertname', 'application', 'severity', 'environment']
group_wait: 5s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: hathor-alert-manager-sns
sns_configs:
- api_url: https://sns.us-east-1.amazonaws.com
sigv4:
region: us-east-1
access_key: <string>
secret_key: <string>
topic_arn: arn:aws:sns:us-east-1:1234567890:your-sns-topic-name
subject: '{{ template "sns.hathor.subject" . }}'
message: '{{ template "sns.hathor.text" . }}'
attributes:
application: '{{ or .CommonLabels.application "-" }}'
chart: '{{ or .CommonAnnotations.link "-" }}'
runbook: '{{ or .CommonAnnotations.runbook "-" }}'
severity: '{{ or .CommonLabels.severity "-" }}'
source: prometheus
status: '{{ or .Status "-" }}'
title: '{{ or .CommonLabels.alertname "-" }}'
templates:
- /etc/alertmanager/config/*.tmpl

Note that beyond configuring this configuration file, it is also necessary to configure each notification receiver. In this proof of concept, the only receiver is AWS SNS.

config/template_sns.tmpl is optional and specifies a template for notification messages dispatched by Alertmanager to its defined receivers. For example:

poc/alertmanager/config/template_sns.tmpl

# template_sns.tmpl

{{ define "sns.hathor.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}{{ end }}

{{ define "sns.hathor.text" }}
{{- $root := . -}}
{{ template "sns.hathor.subject" . }}
{{ range .Alerts }}
*Severity:* `{{ .Labels.severity }}`
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Chart:*: {{ .Annotations.link }}
*Runbook:* {{ .Annotations.runbook }}
*Details:*
{{ range .Labels.SortedPairs }} - *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}

What's next?