Monitoring system — PoC

Introduction

This article presents a proof of concept of the monitoring solution suggested in the article Monitoring — main reference. It aims to serve as an example for developers implementing a Prometheus-based monitoring system for their full nodes.

info

This article is intended to be used in conjunction with two other articles, namely:

Overview

This proof of concept comprises the following components:

Hathor full node
Node exporter
Prometheus
Grafana
Alertmanager
Docker compose

Docker compose is used to orchestrate containers for the other five components. Thus, this proof of concept consists of a set of configuration files organized in the directory structure as follows:

poc/
├── prometheus/
│   ├── prometheus.yml
│   └── alerting_rules.yml
├── grafana/
│   ├── dashboards/
│   │   ├── hathor-core/
│   │   │   └── hathor_fullnodes.json
│   │   └── dashboards.yml
│   └── datasources/
│       └── prometheus.yml
├── alertmanager/
│   ├── config/
│   │   └── template_sns.tmpl
│   └── alertmanager.yml
└── docker-compose.yml

The following sections discuss the configuration files for Docker compose, Prometheus, Grafana and Alertmanager. Hathor full node and Node exporter do not require separate configuration files, as all their execution parameters are already configured within Docker compose. Now, to download the source code of this proof of concept, that is, the configuration files presented throughout this article, use hathor-monitoring-system-poc.zip.

tip

<Placeholders>: in the code samples of this article, as in all Hathor docs, <placeholders> are always wrapped by angle brackets < >. You shall interpret or replace a <placeholder> with a value according to the context. Whenever replacing a <placeholder> like this one with a value, do not wrap the value with quotes. Quotes, when necessary, will be indicated, wrapping the "<placeholder>" like this one.

Note that many configurations depend on the deployment environment. In this proof of concept, the case of AWS is considered.

Docker compose

In the root of the poc directory, the docker-compose.yml file is located. This file is the only configuration file required to orchestrate containers for all five components of the proof of concept. For example:

poc/docker-compose.yml

services:
  hathor-core:
    image: hathornetwork/hathor-core
    command: run_node
    ports:
      - "8080:8080""
    volumes:
      - <absolute_path_hathor_full_node>/data:/data
    environment:
      HATHOR_TESTNET=true
      HATHOR_STATUS=8080
      HATHOR_WALLET_INDEX=true
      HATHOR_CACHE=true
      HATHOR_CACHE_SIZE=100000
      HATHOR_DATA=/data
      HATHOR_PROMETHEUS=true
      HATHOR_PROMETHEUS_PREFIX='hathor_core:'

  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
      - '--collector.textfile.directory=/host/data/prometheus'
    ports:
      - "9100:9100"
    pid: host
    restart: unless-stopped
    volumes:
      - <absolute_path_hathor_full_node>:/host:ro,rslave
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus:/etc/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    environment:
      AWS_ACCESS_KEY_ID=<string>
      AWS_SECRET_ACCESS_KEY=<string>
    ports:
      - '9090:9090'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    ports:
      - '3000:3000'
    environment:
      GF_SECURITY_ADMIN_PASSWORD=<admin_password>
      GF_PATHS_PROVISIONING=/etc/grafana/provisioning
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager
    volumes:
      - ./alertmanager:/etc/alertmanager
    ports:
      - '9093:9093'
    networks:
      - monitoring

networks:
  monitoring:

Prometheus

The prometheus directory contains the two configuration files required for running Prometheus:

prometheus.yml
alerting_rules.yml

prometheus.yml specifies the overall configuration for the execution of Prometheus. For example:

poc/prometheus/prometheus.yml

global:
  scrape_interval: 15s

rule_files:
  - /etc/prometheus/alerting_rules.yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

scrape_configs:
  - job_name: 'local-nodes'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          # These labels are used by Grafana
          network: testnet
          instance_name: hathor-full-node-testnet-local
  - job_name: 'aws-nodes'
    ec2_sd_configs:
      - region: <string>

alerting_rules.yml specifies the alarms and alerts to be created in Prometheus. For example:

poc/prometheus/alerting_rules.yml

groups:
  - name: hathor-full-nodes-blocks.rules
    rules:
    - alert: FullNodeBlocksWarning
      expr: increase(hathor_core:blocks{job='aws-nodes'}[5m]) < 1
      for: 20m
      labels:
        application: hathor-core
        severity: warning
      annotations:
        summary: Fullnode blocks not syncing - {{ $labels.instance_name }}
        description: "The Fullnode has not received any blocks for 25 minutes \n  VALUE = {{ $value }}\n"
        link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
        runbook: you can add here a link to a runbook with instructions on how to fix this issue
  - name: hathor-full-nodes-disk.rules
    rules:
    - alert: FullNodeUsedDiskSpaceWarning
      expr: ((node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'} - node_filesystem_avail_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'}) / (node_filesystem_size_bytes{job='aws-nodes',device=~'/dev/.*', mountpoint!~'/snap/.*'})) * 100  > 85
      for: 10m
      labels:
        application: hathor-core
        severity: warning
      annotations:
        summary: FullNode used disk space - {{ $labels.instance_name }}
        description: "More than 85% of the disk space has been used\n  VALUE = {{ $value }}\n"
        link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
        runbook: you can add here a link to a runbook with instructions on how to fix this issue
  - name: hathor-full-nodes-cpu.rules
    rules:
    - alert: FullNodeCpuUsageWarning
      # The offset is used to ignore the first 6h of metrics in recently created full-nodes, since their initial syncing process could use a lot of CPU
      expr: 1 - rate(node_cpu_seconds_total{mode='idle',job='aws-nodes'}[5m]) > 0.85 and ON(instance) (node_cpu_seconds_total{mode='idle',job='aws-nodes'} offset 6h) > 0
      for: 15m
      labels:
        application: hathor-core
        severity: warning
      annotations:
        summary: FullNode high cpu usage - {{ $labels.instance_name }}
        description: "The cpu usage is higher than 85%\n  VALUE = {{ $value }}\n"
        link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
        runbook: you can add here a link to a runbook with instructions on how to fix this issue
  - name: hathor-full-nodes-memory.rules
    rules:
    - alert: FullNodeMemoryUsageMajor
      expr: ((node_memory_MemTotal_bytes{job='aws-nodes'} - node_memory_MemFree_bytes{job='aws-nodes'} - node_memory_Cached_bytes{job='aws-nodes'} - node_memory_Buffers_bytes{job='aws-nodes'} - node_memory_Slab_bytes{job='aws-nodes'}) / (node_memory_MemTotal_bytes{job='aws-nodes'} )) * 100 > 95
      for: 5m
      labels:
        application: hathor-core
        severity: major
      annotations:
        summary: FullNode memory usage too high - {{ $labels.instance_name }}
        description: "The memory usage is higher than 95%\n  VALUE = {{ $value }}\n"
        link: https://your-grafana-domain/explore?left=%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22expr%22:%22hathor_core:connected_peers%7Bjob%3D%27aws-nodes%27%7D%20%3D%3D%200%22,%22format%22:%22time_series%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22interval%22:%22%22,%22editorMode%22:%22code%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1
        runbook: you can add here a link to a runbook with instructions on how to fix this issue

Grafana

The grafana directory contains the three configuration files required for running Grafana:

datasources/prometheus.yml
dashboards/dashboards.yml
dashboards/hathor-core/hathor_fullnodes.json

datasources/prometheus.yml specifies how Grafana connects to Prometheus as a data source. For example:

poc/grafana/datasources/prometheus.yml

apiVersion: 1

datasources:
  - name: prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Note that one can only use prometheus in the http://prometheus:9090 URL when using Docker compose. Otherwise, one needs to use the network address of the Prometheus server.

dashboards/dashboards.yml specifies the dashboard configuration in Grafana. For example:

poc/grafana/dashboards/dashboards.yml

apiVersion: 1

providers:
  # A unique provider name: <string> (required)
  - name: '<string>'
    # Org id: <int> (default to 1)
    orgId: 1
    # Name of the dashboard folder: <string>
    folder: ''
    # Folder UID: <string> (automatically generated if not specified)
    folderUid: ''
    # Provider type: <string> (default to 'file')
    type: file
    # Disable dashboard deletion: <bool>
    disableDeletion: false
    # How often Grafana scan for dashboards' updates: <int>
    updateIntervalSeconds: 30
    # Allow updating provisioned dashboards from UI: <bool>
    allowUiUpdates: false
    options:
      # Path to dashboard files on disk: <string> (required when using 'file')
      path: /etc/grafana/provisioning/dashboards
      # Use folder names from filesystem to create folders in Grafana: <bool>
      foldersFromFilesStructure: true

dashboards/hathor-core/hathor_fullnodes.json is the source code for the the dashboard created by Hathor Labs to facilitate day-to-day operation of full nodes. To get this dashboard, refer to Hathor full node public dashboard.

Alertmanager

The alertmanager directory contains two configuration files for running Alertmanager.

alertmanager.yml
config/template_sns.tmpl

alertmanager.yml is required and specifies the overall configuration for the execution of Alertmanager. For example:

poc/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: hathor-alert-manager-sns
  group_by: ['alertname', 'application', 'severity', 'environment']
  group_wait: 5s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: hathor-alert-manager-sns
  sns_configs:
  - api_url: https://sns.us-east-1.amazonaws.com
    sigv4:
      region: us-east-1
      access_key: <string>
      secret_key: <string>
    topic_arn: arn:aws:sns:us-east-1:1234567890:your-sns-topic-name
    subject: '{{ template "sns.hathor.subject" . }}'
    message: '{{ template "sns.hathor.text" . }}'
    attributes:
      application: '{{ or .CommonLabels.application "-" }}'
      chart: '{{ or .CommonAnnotations.link "-" }}'
      runbook: '{{ or .CommonAnnotations.runbook "-" }}'
      severity: '{{ or .CommonLabels.severity "-" }}'
      source: prometheus
      status: '{{ or .Status "-" }}'
      title: '{{ or .CommonLabels.alertname "-" }}'
templates:
- /etc/alertmanager/config/*.tmpl

Note that beyond configuring this configuration file, it is also necessary to configure each notification receiver. In this proof of concept, the only receiver is AWS SNS.

config/template_sns.tmpl is optional and specifies a template for notification messages dispatched by Alertmanager to its defined receivers. For example:

poc/alertmanager/config/template_sns.tmpl

# template_sns.tmpl

{{ define "sns.hathor.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}{{ end }}

{{ define "sns.hathor.text" }}
{{- $root := . -}}
{{ template "sns.hathor.subject" . }}
{{ range .Alerts }}
  *Severity:* `{{ .Labels.severity }}`
  *Summary:* {{ .Annotations.summary }}
  *Description:* {{ .Annotations.description }}
  *Chart:*: {{ .Annotations.link }}
  *Runbook:* {{ .Annotations.runbook }}
  *Details:*
    {{ range .Labels.SortedPairs }} - *{{ .Name }}:* `{{ .Value }}`
    {{ end }}
{{ end }}
{{ end }}

What's next?

Monitoring — main reference: to consult about monitoring while operating full nodes.
Metrics: reference material regarding metrics tracked by Hathor core.
Install a monitoring system: step-by-step to install a Prometheus-based monitoring system.
Full node pathway: for how to operate a full node.

Introduction​

Overview​

Docker compose​

Prometheus​

Grafana​

Alertmanager​

What's next?​