Skip to main content

Install monitoring system for full node(s)

Goal

This article will guide you to install a Prometheus-based monitoring system for your Hathor full node(s).

To know more about how this system works, see Hathor full node monitoring.

Requirement

Hathor core \ge v0.58.0

Step-by-step

  1. Set up Hathor full node.
  2. Set up Node exporter.
  3. Set up Prometheus.
  4. Set up Grafana.
  5. Set up Alertmanager.
  6. Define alerting rules.

Step 1: set up Hathor full node

Metric generation is not part of the default configurations of Hathor full node, and should be defined at startup.

  1. Start a shell session.
  2. Change the working directory to where you installed your full node.
  1. If you installed your full node from source code, start it using the --prometheus option. For example:
poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index --prometheus

Additionally, you can use the --prometheus-prefix option, to append a prefix to the name of each metric generated by the full node. This is helpful if you are going to monitor multiple other systems and necessary if you are going to monitor multiple full node instances.

  1. (Optional) If you are going to monitor multiple systems, add --prometheus-prefix="hathor_" to the previous command, replacing hathor with a prefix that best suits your setup. For example:
poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index --prometheus --prometheus-prefix="hathor_"

As a best practice, Prometheus documentation recommends the usage of the name of the system as prefix. To know more about it, refer to Prometheus docs — Metric and label naming. As a result "hathor_" and "hathor_full_node_" are good alternatives for the argument of option --prometheus-prefix.

  1. (Optional) Now, if you are going to monitor multiple full node instances, choose a value for the argument of --prometheus-prefix that distinguishes which one the metric refers to. For example: --prometheus-prefix="hathor_full_node_1_" or --prometheus-prefix="node_1_", substituting the numeral with the internal ID you assign to each full node instance.

Your full node will now generate metrics and save them in the hathor.prom file, stored at the data/prometheus directory.

  1. Repeat the process for each full node instance you operate.

Step 2: set up Node exporter

Node exporter will be configured to collect metrics from all files with .prom extension at data/prometheus and expose metrics to be scraped by Prometheus.

  1. Install Node exporter on the same host of your Hathor full node. To know how to do it, refer to Node Exporter Github repository. If you have multiple instances of Hathor full node, each host must have its own instance of Node exporter.

  2. Start a shell session.

  3. Change the working directory to where you installed Node exporter.

  4. Start Node exporter using the --collector.textfile.directory=<absolute_path_hathor_full_node>/data/prometheus option (or equivalent environment variable if using Docker compose), replacing the <absolute_path_hathor_full_node> placeholder with the absolute path of the directory where you installed your full node.

Node exporter will now collect metrics from both the full node and its host. By default, it will listen on port 9100 and expose metrics at the /metrics endpoint.

  1. Verify that metrics are available for exporting:
curl http://localhost:9100/metrics

The expected result is to receive metrics from both the full node and its host.

  1. Repeat the process for each full node instance you operate.

Step 3: set up Prometheus

Prometheus will be configured to scrape full node's metrics exposed by each Node exporter instance.

  1. Install Prometheus on a different host from your full node(s). To know how to do it, refer to Prometheus docs — installation and Prometheus docs — getting started.

Now, you need to configure each Node exporter as a target for Prometheus to scrape.

  1. Start a shell session.
  2. Change the working directory to where you installed Prometheus.
  3. Open Prometheus configuration file (typically prometheus.yml).
  4. Within scrape_config, write the following configuration, replacing the <full_node> placeholder with the full node's network address:
<absolute_path_prometheus_installation>/prometheus.yml
...
scrape_configs:
- job_name: 'hathor_full_nodes'
static_configs:
- targets: ['<full_node>:9100']
labels:
network: 'my_network'
group: 'production'
...

To know how to do it, refer to Prometheus docs — configuration to monitor targets.

  1. Repeat the previous step for each full node. For example:
<absolute_path_prometheus_installation>/prometheus.yml
...
scrape_configs:
- job_name: 'hathor_full_nodes'
static_configs:
- targets: ['<full_node_1>:9100', '<full_node_2>:9100', ..., '<full_node_n>:9100']
labels:
network: 'my_network'
group: 'production'
...
  1. *(Optional) Alternatively, instead of statically configuring your targets (namely, each of your full nodes), they can be dynamically discovered using one of the supported service discovery mechanisms. To know more about it, refer to Prometheus docs — scrape configuration.

  2. Restart Prometheus as usual.

Prometheus will scrape metrics from each existent instance of Node exporter. It exposes all aggregated data through its HTTP API and listens on port 9090 by default.

  1. Use Prometheus' expression browser to verify that Prometheus now has information about time series exposed by all Node exporter instances.

Step 4: set up Grafana

Grafana will be configured to visualize the metrics collected from your full nodes by Prometheus.

  1. Install Grafana in a different host from your full node(s). To know how to do it, refer to Grafana docs — installation

  2. Configure Prometheus as a data source for Grafana. To know how to do it, refer to Grafana docs — configure Prometheus.

Hathor Labs provides a public dashboard already preconfigured for monitoring full nodes. This dashboard was conceived to facilitate day-to-day operation of full nodes.

  1. Import Hathor full node public dashboard. To know how to do it, refer to Grafana docs — import dashboards. If your full node(s) is(are) executing with the prometheus-prefix parameter, Grafana will prompt you to enter a value for Hathor CLI Prefix. Use the same value you previously defined in substeps 1.4 or 1.5 of this guide.

Step 5: set up Alertmanager

Alertmanager will be configured to receive alerts from Prometheus, handle these alerts, and send notifications to you and your team, whenever some of your full nodes become (or are becoming) unhealthy.

  1. Install Alertmanager. To know how to do it, refer to Alertmanager Github repository.

Now you must configure Prometheus to send alerts to Alertmanager, which listens on port 9093 by default.

  1. Start a shell session on Prometheus' host.
  2. Change the working directory to where you installed Prometheus.
  3. Open Prometheus configuration file (typically prometheus.yml).
  4. Within alerting, write the following configuration, replacing the <alertmanager> placeholder with Alertmanager's network address:
<absolute_path_prometheus_installation>/prometheus.yml
...
alerting:
alertmanagers:
static_configs:
- targets: ['<alertmanager>:9093']
...
  1. (Optional) Alternatively, instead of statically configuring Alertmanager in Prometheus, you can use service discovery to locate the Alertmanager instance. To know how to do it, refer to Prometheus docs — Alertmanager configuration. An example for a monitoring system deployed on AWS:
<absolute_path_prometheus_installation>/prometheus.yml
...
alerting:
alertmanagers:
- static_configs:
- targets: [] # Empty targets because we're using service discovery for Alertmanager
# EC2 service discovery configuration
sd_configs:
- refresh_interval: 30s
region: <your_aws_region>
access_key: <your_aws_access_key> # Could be replaced by an AWS profile or a Role attached to the Prometheus instance
secret_key: <your_aws_secret_key>
port: 9093 # Assuming Alertmanager is running on port 9093
filters:
tag_instance: <your_alertmanager_instance_tag_value>
...
  1. Restart Prometheus as usual.

Now, you need to configure how Alertmanager should handle alerts and send notifications — i.e., (1) notification receivers, (2) notification routing, and (3) inhibition rules:

  • Notification receivers (1) refers to the systems to which Alertmanager will dispatch notifications.
  • Notification routing (2) refers to the conditions under which notifications will be sent to the receivers.
  • Inhibition rules (3) refers to how alerts should interact with each other to ensure only useful notifications are dispatched.
  1. Configure the systems you want to have as notification receivers. Regarding notification dispatching, Alertmanager supports SMTP, webhooks and others, and integrates with multiple systems such as Discord, Opsgenie, Slack, Telegram, SNS, etc. To know how to do it, refer to Prometheus docs — Alertmanager configuration — receiver integration.

  2. Configure the routes to notifications: to whom, when, and how notifications should be dispatched. To know how to do it, refer to Prometheus docs — Alertmanager configuration — routing.

  3. Configure the inhibition rules. To know how to do it, refer to Prometheus docs — Alertmanager configuration — inhibition rules.

  4. (Optional) Configure a template to construct your notifications. Templates allow for the customization of messages within alert notifications. To know more about it refer to Prometheus docs — Alertmanager — Notification template reference and Prometheus docs — Alertmanager — Notification template examples.

The following snippet is an example of how your Alertmanager configuration file (typically alertmanager.yml) might look like after performing the previous substeps:

<absolute_path_alertmanager_installation>/alertmanager.yml
...
global:
resolve_timeout: 5m
...
route:
receiver: hathor-alert-manager-sns
group_by: ['alertname', 'application', 'severity', 'environment']
group_wait: 5s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: hathor-alert-manager-sns
sns_configs:
- api_url: https://sns.us-east-1.amazonaws.com
sigv4:
region: us-east-1
access_key: <your_aws_access_key>
secret_key: <your_aws_secret_key>
topic_arn: arn:aws:sns:us-east-1:1234567890:your-sns-topic-name
subject: '{{ template "sns.hathor.subject" . }}'
message: '{{ template "sns.hathor.text" . }}'
attributes:
application: '{{ or .CommonLabels.application "-" }}'
chart: '{{ or .CommonAnnotations.link "-" }}'
runbook: '{{ or .CommonAnnotations.runbook "-" }}'
severity: '{{ or .CommonLabels.severity "-" }}'
status: '{{ or .Status "-" }}'
title: '{{ or .CommonLabels.alertname "-" }}'
templates:
- /etc/alertmanager/config/*.tmpl
...

In this example, all alerts are dispatched to an AWS SNS topic, where customized tools can then process them further.

Attributes serves as a flexible metadata object where you can include any information to be dispatched along with the notification to the receivers. We've added attributes that could be useful for further processing, such as the application name, chart link, runbook link, severity, status, and title. Note that the usage of .CommonLabels and .CommonAnnotations is only possible since group_by ensures that any grouped alerts have the same labels and annotations.

In lines 29 and 30, you can see template configuration. Templates are defined in separated files. An example of a notification template for SNS:

sns.tmpl
{{ define "sns.hathor.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}{{ end }}

{{ define "sns.hathor.text" }}
{{- $root := . -}}
{{ template "sns.hathor.subject" . }}
{{ range .Alerts }}
*Severity:* `{{ .Labels.severity }}`
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Chart:*: {{ .Annotations.link }}
*Runbook:* {{ .Annotations.runbook }}
*Details:*
{{ range .Labels.SortedPairs }} - *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}

This template defines two variables: sns.hathor.text and sns.hathor.subject:

  • sns.hathor.subject: if alert is in firing state, it prefixes the subject with [FIRING: 3], where '3' represents the number of firing alerts included in the notification. It then appends the name of the alert.
  • sns.hathor.text: it first appends the sns.hathor.subject variable. Then, for each alert, it adds a set of attributes derived from the alert's labels and annotations. Details includes all labels from the alert, sorted by their names.

Step 6: define alerting rules

Finally, you must define alerting rules. An alerting rule describes that a full node entered (or is entering) in an unhealthy condition and specifies how alerts should be issued in response. Alerting rules are defined on a specific YAML file (e.g., alerting_rules.yml), and referenced in Prometheus configuration file (typically prometheus.yml). To know how to do it, refer to Prometheus docs — alerting rules.

  1. Start a shell session on Prometheus' host.
  2. Change the working directory to where you installed Prometheus.
  3. Create alerting_rules.yml.
  4. Open Prometheus configuration file (typically prometheus.yml).
  5. Within alerting_rules, write the following configuration, replacing the <absolute_path_prometheus> placeholder with the absolute path to alerting_rules.yml:
<absolute_path_prometheus_installation>/prometheus.yml
global:
...
alerting_rules:
- <absolute_path_prometheus>/alerting_rules.yml
...
  1. Define alerting rules for high CPU usage, and write them in alerting_rules.yml. For example:
<absolute_path_alertmanager_installation>/alerting_rules.yml
groups:
- name: hathor-full-nodes-cpu.rules
rules:
- alert: FullNodeCpuUsageWarning
# Offset used to ignore first 6hs of metrics in recently deployed full nodes, since initial synchronization consumes lot of CPU.
expr: 1 - rate(node_cpu_seconds_total{mode='idle',job='aws-nodes'}[5m]) > 0.85 and ON(instance) (node_cpu_seconds_total{mode='idle',job='aws-nodes'} offset 6h) > 0
for: 15m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode high cpu usage - {{ $labels.instance_name }}
description: "The cpu usage is higher than 85%\n VALUE = {{ $value }}\n"
link: <URL_to_specific_grafana_panel>
runbook: you can add here a link to a runbook with instructions on how to fix this issue
  1. Define alerting rules for high memory usage, and write them in alerting_rules.yml. For example:
absolute_path_alertmanager_installation>/alerting_rules.yml
groups:
- name: hathor-full-nodes-cpu.rules
...
- name: hathor-full-nodes-memory.rules
rules:
- alert: FullNodeMemoryUsageMajor
expr: ((node_memory_MemTotal_bytes{job='aws-nodes'} - node_memory_MemFree_bytes{job='aws-nodes'} - node_memory_Cached_bytes{job='aws-nodes'} - node_memory_Buffers_bytes{job='aws-nodes'} - node_memory_Slab_bytes{job='aws-nodes'}) / (node_memory_MemTotal_bytes{job='aws-nodes'} )) * 100 > 95
for: 5m
labels:
application: hathor-core
severity: major
annotations:
summary: FullNode memory usage too high - {{ $labels.instance_name }}
description: "The memory usage is higher than 95%\n VALUE = {{ $value }}\n"
link: <URL_to_specific_grafana_panel>
runbook: you can add here a link to a runbook with instructions on how to fix this issue
  1. Define alerting rules for high disk usage, and write them in alerting_rules.yml. For example:
absolute_path_alertmanager_installation>/alerting_rules.yml
groups:
- name: hathor-full-nodes-cpu.rules
...
- name: hathor-full-nodes-memory.rules
...
- name: hathor-full-nodes-disk.rules
rules:
- alert: FullNodeUsedDiskSpaceWarning
expr: ((node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'} - node_filesystem_avail_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'}) / (node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'})) * 100 > 85
for: 10m
labels:
application: hathor-core
severity: warning
annotations:
summary: FullNode used disk space - {{ $labels.instance_name }}
description: "More than 85% of the disk space has been used\n VALUE = {{ $value }}\n"
link: <URL_to_specific_grafana_panel>
runbook: you can add here a link to a runbook with instructions on how to fix this issue
  1. Define alerting rules for full node not syncing with its peers, and write them in alerting_rules.yml. For example:
<absolute_path_alertmanager_installation>/alerting_rules.yml
groups:
- name: hathor-full-nodes-cpu.rules
...
- name: hathor-full-nodes-memory.rules
...
- name: hathor-full-nodes-disk.rules
...
- name: hathor-full-nodes-blocks.rules
rules:
- alert: FullNodeBlocksWarning
expr: increase(hathor_core:blocks{job='aws-nodes'}[5m]) < 1
for: 20m
labels:
application: hathor-core
severity: warning
annotations:
summary: Fullnode blocks not syncing - {{ $labels.instance_name }}
description: "The Fullnode has not received any blocks for 25 minutes \n VALUE = {{ $value }}\n"
link: <URL_to_specific_grafana_panel>
runbook: you can add here a link to a runbook with instructions on how to fix this issue
  1. Restart Prometheus as usual.

It's possible to test an alerting rule using the value of expr as a query on Grafana explore.

  1. Test each of the defined alerting rules using Grafana explore. To know how to do it, refer to Grafana docs — query management in explore.
note

We suggest you define multiple alerting rules for each of the four listed alarms — namely, CPU, memory, and disk usage, and synchronization —, with different levels of severity, such as 'warning', 'minor', and 'major'. For example, high disk usage may trigger 'warning' once 85% (of disk usage) is reached, 'minor' once 90% is reached and 'major' when 95% is reached.

Task completed

You now have a complete Prometheus-based monitoring system, capable of monitoring all full nodes and any other systems you operate.

What's next?