Full node troubleshooting

Goal

This article will guide you to troubleshoot Hathor full node. Each of the following sections provides resolution to a problem you may encounter while operating and using it.

Installation failure: RocksDB

Situation

You are installing Hathor core from source code. You ran poetry install to install the project dependencies and received the following error message:

- Installing rocksdb (0.9.2 1f0ce6a): Failed

Cause

Poetry couldn't find a compatible version of RocksDB library for the project in your environment. Either it wasn't previously installed, or it's not accessible. Hathor core requires RocksDB library v9.

Solutions

First, make sure RocksDB library v9 is installed. If it is, make sure it's accessible to Poetry. If not, use your chosen package manager to install it (apt, brew, etc.).

Now, It's possible that no RocksDB library v9 package is available for your specific environment. For example, on Ubuntu 20.04 LTS, the latest available version of librocksdb-dev package in the official Ubuntu repository is v5.17, which is no longer compatible with hathor-core. Another issue may arise on macOS when using Homebrew package manager. Since it's not possible to choose which version of the package will be installed, Homebrew may provide a newer version in its official repository that is not yet supported by hathor-core.

If it’s not possible to obtain RocksDB library v9 via package manager, you’ll need to install it from source code. To do this, follow this procedure:

Start a shell session.
Change the working directory to where you want to install the RocksDB library.
Download RocksDB source code:

git clone -b 9.3.fb https://github.com/facebook/rocksdb.git

Change to RocksDB source code directory:

cd rocksdb

Compile RocksDB: (this may take a few minutes)

make -d shared_lib

Install RocksDB:

sudo make install-shared

Proceed with the usual installation process of Hathor core.

Fatal Python error: aborted

Situation

You ran a hathor-cli command, and at some point, the execution was aborted with the following error message:

Fatal Python error: Aborted.

Diagnosis

Inspect the traceback of the error and check whether the second-to-last call (the second entry in the stack trace) refers to file pycoin/ecdsa/native/openssl.py. If that’s the case, the Python module that failed to execute was pycoin.ecdsa.native.openssl.

Cause

pycoin tries to dynamically load the OpenSSL library but fails to find a compatible version. This error typically occurs when the OpenSSL library is installed via Homebrew on macOS.

Solution

If you installed OpenSSL via Homebrew, add to your shell configuration file (Bash or Zsh):

export PYCOIN_LIBCRYPTO_PATH="$(brew --prefix openssl)/lib/libcrypto.dylib"

Then, reload the shell configuration:

Bash
Zsh

source ~/.bashrc

source ~/.zshrc

Initialization failure: RocksDB

Situation

You ran the command to start the full node and received an error message related to RocksDB library.

Causes

During the installation process, you used poetry install to install the dependencies of hathor-core. One of them was the package rocksdb, which is a Python binding for the native C++ RocksDB library. Then, there are two possible causes for the error encountered:

The rocksdb dependency might not be correctly linking hathor-core to the native RocksDB library.
Alternatively, the dependency may be correct, but at runtime, the operating system's dynamic loader is failing to locate the RocksDB library.
Finally, it's possible that the rocksdb dependency is linking hathor-core to an incompatible version of the native RocksDB library.

Solutions

First, reinstall the rocksdb dependency:

poetry run pip uninstall -y rocksdb && poetry run pip install --no-binary :all: git+https://github.com/hathornetwork/python-rocksdb.git

Then, if the installation of rocksdb was successful but the error still persists, identify where the RocksDB library is installed and instruct the operating system's dynamic loader to locate it.

Initialization failure: unreliable storage

Situation

You ran the command to start the full node and received the following error message:

[error][hathor.manager] Error initializing node. The last time you executed your full node it wasn't stopped correctly. The storage is not reliable anymore and, because of that, so you must run a full verification or remove your storage and do a full sync.

Cause

It happens when the full node's database (specifically the ledger) is in an unreliable state. As explained by the message, this occurs when the full node does not shut down properly — e.g., the process was abruptly killed, or the machine turned off.

Solution

Restart your full node from an empty database. This solution entails the initial synchronization of the ledger from genesis block, which naturally takes a long time. However, one can expedite it by using a snapshot (of the ledger).

If you want to carry out this solution using a snapshot, see Slow initial synchronization.

If you want to carry out this solution without using a snapshot, follow this procedure:

Start a shell session.
Change the working directory to <absolute_path_hathor_full_node>/data, replacing the <absolute_path_hathor_full_node> placeholder with the directory where you installed your full node.
Remove all content of the data subdirectory.
Restart your full node as usual.

warning

If you are using the ledger events produced by your full node, your event-consuming client application will need to discard all events and reprocess them from scratch.

The ledger events generated by the full node are specific to a database instance. Therefore, whenever you restart from an empty database, the events logged in event-consuming client applications of the full node will no longer be valid.

For more on this, see the warning box in section Initialization failure: event queue disabled.

Initialization failure: event queue disabled

Situation

You ran the command to start the full node and received the following error message:

hathor.exception.InitializationError: Cannot start manager without event queue feature, as it was enabled in the previous startup. Either enable it, or use the reset-event-queue CLI command to remove all event-related data

Cause

The last time this full node ran, the ledger events producer was enabled. Now, you are trying to start it with this feature disabled. However, the full node does not initialize with the feature disabled as long as its database still contains previously generated ledger events.

Solutions

There are two possible solutions:

Enable ledger events producer.
Discard ledger events from database.

Solution (1) means restarting the full node with the ledger events producer enabled. In a production environment, this is almost always the preferred alternative. Even if your full node does not need to provide ledger events to clients now, it may be required in the future, and the additional resource consumption to keep the feature running is negligible.

Solution (2) means discarding the ledger events from the full node's database. It is mostly used for adjustments in development and test environments or in rare cases where the database has entered an unreliable state, and the ledger events have become unusable.

Solution 1: enable ledger events producer

See Enable ledger events producer at Full node key configurations.

Solution 2: discard ledger events from database

Start a shell session.
Change the working directory to where you installed your full node — namely, hathor-core and database, docker-compose.yml and database, or just the database.
Run the command to discard the ledger events from the database. For example:

Source code
Docker

cd hathor-core && poetry run hathor-cli reset-event-queue --data ../data

docker run \
  -v ${PWD}/data:/data \
  hathornetwork/hathor-core \
  reset-event-queue --data /data

Note that subcommand hathor-cli reset-event-queue requires option --data <your_full_node_database>, where placeholder <your_full_node_database> refers to the directory where you stored your full node's database — typically, data.

If the procedure is successful, you will receive the following informational message:

[info][hathor.cli.reset_event_queue] removing all events and related data...
[info][hathor.cli.reset_event_queue] reset complete

warning

If the full node discards its ledger events database and starts a new one, client applications must do the same. Otherwise, they will no longer receive new event notifications.

For example, suppose your full node operated with the ledger events producer enabled. During this period, several client applications connected to your full node’s async API and subscribed to be notified of these events. However, after a full node failure, it was necessary to sync from scratch with the network. As a result, the full node generated a new ledger event database from scratch. To receive event notifications again, client applications need to discard their event database and restart from scratch to sync with the full node’s new event database.

Peer discovery failure

Situation

Your full node is initializing but stalls during the peer discovery phase. It logs one or more of the following error messages:

[error][hathor.p2p.peer_discovery.dns] errback extra={'dns_seed_lookup_text', 'alpha.nano-testnet.hathor.network'} result=<twisted.python.failure.Failure builtins.OSError: [Errno 65] No route to host>

Cause

The full node is failing to perform the DNS query. This is a known issue in Hathor core that occurs only in very specific environments, typically related to OS and home LAN configurations of personal machines.

The DNS provides a set of peer IPs that serves as the seed for peer discovery. Without this seed, the full node is unable to connect to the network, as it does not know any of its peers.

Workaround

Until this issue is resolved, you can manually work around it by querying the network's DNS and providing the retrieved IPs to Hathor core as seed for peer discovery. To carry out this workaround, follow this procedure:

Select the DNS corresponding to the Hathor Network instance you want to connect to:
- Mainnet: mainnet.hathor.network
- Testnet: golf.testnet.hathor.network
- Nano-testnet: alpha.nano-testnet.hathor.network
Start a shell session.

Query the DNS of the selected network. For example:

dig TXT golf.testnet.hathor.network

In ANSWER SECTION you will obtain the IPs to be used. For example:

...
;; ANSWER SECTION:
golf.testnet.hathor.network. 60	IN	TXT	"tcp://18.156.174.211:40403/?id=6d6d72156f20d294c6677a8963ebe70df66b5beaf12773c16de250f8275fb6c5"
golf.testnet.hathor.network. 60	IN	TXT	"tcp://18.199.240.217:40403/?id=e4466f8e05e93dc7b077af3807830bee296936772033b73ee32da59e5400d8fd"
golf.testnet.hathor.network. 60	IN	TXT	"tcp://34.230.30.110:40403/?id=ffcc778abd0cf1be33062bbdaa48f9909e2e5a2947390efc72070c38c0505e69"
...

Restart your full node using one or more times the --bootstrap option/environment variable, once for each IP you want to provide as seed. For example:

Source code
Docker container
Docker compose

poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index --bootstrap tcp://18.156.174.211:40403 --bootstrap tcp://18.199.240.217:40403 --bootstrap tcp://34.230.30.110:40403

docker run \
  -it -p 8080:8080 -v <absolute_path_hathor_full_node>/data:/data \
  hathornetwork/hathor-core \
  run_node --status 8080 --testnet --data /data --wallet-index --bootstrap tcp://18.156.174.211:40403 --bootstrap tcp://18.199.240.217:40403 --bootstrap tcp://34.230.30.110:40403

<absolute_path_hathor_installation>/docker-compose.yml
services:
  hathor-core:
    image: hathornetwork/hathor-core
    command: run_node
    ports:
      - "8080:8080"
      - "8081:8081"
    volumes:
      - <absolute_path_hathor_full_node>/data:/data
    environment:
      - HATHOR_STATUS=8080
      - HATHOR_STRATUM=8081
      - HATHOR_TESTNET=true
      - HATHOR_DATA=/data
      - HATHOR_WALLET_INDEX=true
      - HATHOR_CACHE=true
      - HATHOR_CACHE_SIZE=100000
      - HATHOR_BOOTSTRAP=tcp://18.156.174.211:40403
      - HATHOR_BOOTSTRAP=tcp://18.199.240.217:40403
      - HATHOR_BOOTSTRAP=tcp://34.230.30.110:40403
...

Note that providing a single IP is sufficient as a seed for peer discovery, while multiple IPs ensure that the full node can successfully find an available peer. Additionally, option/environment variable --bootstrap must receive exactly one argument per use. Therefore, you must append it as many times as the number of IPs you want to provide.

tip

Instead of manually querying the DNS, you can embed it directly into the command used to start your full node. For example:

HATHOR_DNS=golf.testnet.hathor.network; poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index $(dig TXT $HATHOR_DNS +short | sed 's/"/--bootstrap /' | sed 's/"//')

Unable to connect to mainnet

Situation

You have just started a full node, and it is attempting to connect to its peers in mainnet. It then begins logging one or more of the following warning messages:

[warning][hathor.p2p.protocol] remote error payload=Blocked (by <peer_id_of_your_full_node>). Get in touch with Hathor team. peer_id=None remote=<IP_of_some_peer>:40403

Diagnosis

Send an HTTP API request to check the status of the full node. For example:

curl -X GET http://localhost:8080/v1a/status/ | jq .connections

In the API response, look for the connections object. If its properties connected_peers, handshaking_peers, and connecting_peers, all have empty arrays, it means your full node is unable to connect to any other peer (which means it is not connected to the network):

status HTTP API response
{
  "connected_peers": [],
  "handshaking_peers": [],
  "connecting_peers": []
},

Cause

At the moment, Hathor Network mainnet operates with a whitelist — i.e., only peers whose id is in the whitelist are able to connect to the network. The warning message(s) your full node received means that one or more peers rejected the connection because your full node's peer_id is not in the whitelist.

Solution

See Select network instance at Full node key configurations.

Slow initial synchronization

Situation

You have successfully connected your full node to Hathor Network, but syncing with its peers is taking too much time.

Cause

The initial synchronization is a process that happens when you deploy a new full node, or when you restart a full node after some period offline. In case of a new deploy, it will need to sync the entire ledger from genesis block. In case of restarting a full node that was offline for a long time, it will need to sync the ledger from the point it stopped.

In either case, this processes naturally takes a long time (hours), because the full node must download and validate all transactions and blocks in the ledger

As of February 2024, syncing from genesis block takes on average 10 hours for Hathor Network testnet and 24 hours for mainnet. As time passes and the ledger grows, the time required for initial syncing tends to increase.

Workaround

To expedite this process, you can bootstrap your full node with a snapshot. Snapshots allow nodes to rapidly catch up with the network. The trade off is that your full node will be relying on the snapshot to create its ledger, rather than making the entire validation process on its own.

To use this workaround, see How to bootstrap full node with a snapshot.

For more on snapshots, see Snapshot at encyclopedia.

Peer connection failures

Situation

Your full node is currently operational and is logging one or more of the following warning messages:

[warning][hathor.p2p.protocol] remote error payload=Connection rejected. peer_id=None remote=<IP_of_some_peer>:40403

This means that a peer responded by rejecting the connection.

[warning][hathor.p2p.manager] connection failure endpoint=tcp://<IP_of_some_peer>:40403 failure=User timeout caused connection failure.

This means that a peer did not respond to the connection request.

[warning][hathor.p2p.protocol] Connection closed for idle timeout. peer_id=None remote=<IP_of_some_peer>:54254

Connection failures are a normal aspect of a full node's ongoing operation. As long as your full node remains well-connected to the network, these messages should not be a cause for concern.

Diagnosis

To determine if your full node is well-connected to the network, send an HTTP API request to check its status. For example:

curl -X GET http://localhost:8080/v1a/status/ | jq .connections

In the API response, look for the connections object. Count how many objects the connected_peers property has:

status HTTP API response
{
  "connected_peers": [
    {
      "id": "<connected_peer_id_1>",
      ...
    },
    {
      "id": "<connected_peer_id_2>",
      ...
    },
    ...
    {
      "id": "<connected_peer_id_n>",
      ...
    },
  ],
  ...
},

To be considered well-connected, a full node should average around 20 connections on mainnet, or 5 to 10 on testnet.

HTTP 503: service unavailable

Situation

You sent an HTTP API request to the full node and received the following status message as response: Server Error 503: Service Unavailable.

Diagnosis

Ensure that the Server Error 503: Service Unavailable status message is originating from the full node itself, not from a reverse proxy.

Cause

If the full node itself is responding with a status code 503, this means that it has been started without the wallet-index parameter. As a result, it cannot process requests that depend on this parameter for proper functioning.

Solution

Restart your full node with the --wallet-index option/environment variable. For example:

Source code
Docker container
Docker compose

poetry run hathor-cli run_node --testnet --data ../data --status 8080 --wallet-index

docker run \
  -it -p 8080:8080 -v ${PWD}/data:/data \
  hathornetwork/hathor-core \
  run_node --testnet --data /data --status 8080 --wallet-index

<absolute_path_hathor_installation>/docker-compose.yml
...
services:
  full-node:
    image: hathornetwork/hathor-core
    ports:
      - "8080:8080"
    volumes:
      - ${PWD}/data:/data
    environment:
      - HATHOR_TESTNET=true
      - HATHOR_DATA=/data
      - HATHOR_STATUS=8080
      - HATHOR_WALLET_INDEX=true
      - HATHOR_CACHE=true
      - HATHOR_CACHE_SIZE=100000
    networks:
      hathor:
        aliases:
          - full-node
    command: run_node
  ...

Unresponsive full node

Situation

Your full node was normally responding to your API requests but then suddenly became unresponsive. This typically manifests with one or more of the following error messages:

request timed out
connection timed out
connection reset by peer
unable to connect to the server

Diagnosis

Check the host to ensure the full node is still up and running. If so, this might indicate that your full node is experiencing high CPU usage. See the section High CPU usage of this article.

High CPU usage

Situation

Your full node is presenting one or more of the following symptoms:

It suddenly becomes unresponsive to API requests.
It suddenly rejects all new connections with other peers.
It suddenly drops established connections with its peers.

Diagnosis

When these symptoms appear together, they indicate that your full node is experiencing high CPU usage, which means zero or near-zero CPU idle time. Use a utility — such as top, htop, vmstat, or mpstat —, to confirm high CPU usage on the full node's host.

Cause

The most common cause of high CPU usage is processing API requests related to addresses with a high number of transactions. Some use cases may involve many such addresses and require their full nodes to process multiple requests for them simultaneously. If this happens in your use case, it is likely the root cause of the problem.

Resolution

If this is the case for your full node, the resolution may vary depending on your use case. Otherwise, a deeper investigation will be necessary to identify the root cause of the problem. Regardless, send a message to the #development channel on the Hathor Discord server for assistance from the Hathor team and community members.

I still need help

If this article does not address your problem, or if the provided instructions were insufficient, send a message to the #development channel on Hathor Discord server for assistance from Hathor team and community members.

What's next?

Upgrade Hathor core
Full node pathway: for how to operate this application.

Goal​

Installation failure: RocksDB​

Situation​

Cause​

Solutions​

Fatal Python error: aborted​

Situation​

Diagnosis​

Cause​

Solution​

Initialization failure: RocksDB​

Situation​

Causes​

Solutions​

Initialization failure: unreliable storage​

Situation​

Cause​

Solution​

Initialization failure: event queue disabled​

Situation​

Cause​

Solutions​

Solution 1: enable ledger events producer​

Solution 2: discard ledger events from database​

Peer discovery failure​

Situation​

Cause​

Workaround​

Unable to connect to mainnet​

Situation​

Diagnosis​

Cause​

Solution​

Slow initial synchronization​

Situation​

Cause​

Workaround​

Peer connection failures​

Situation​

Diagnosis​

HTTP 503: service unavailable​

Situation​

Diagnosis​

Cause​

Solution​

Unresponsive full node​

Situation​

Diagnosis​

High CPU usage​

Situation​

Diagnosis​

Cause​

Resolution​

I still need help​

What's next?​

Goal

Installation failure: RocksDB

Situation

Cause

Solutions

Fatal Python error: aborted

Situation

Diagnosis

Cause

Solution

Initialization failure: RocksDB

Situation

Causes

Solutions

Initialization failure: unreliable storage

Situation

Cause

Solution

Initialization failure: event queue disabled

Situation

Cause

Solutions

Solution 1: enable ledger events producer

Solution 2: discard ledger events from database

Peer discovery failure

Situation

Cause

Workaround

Unable to connect to mainnet

Situation

Diagnosis

Cause

Solution

Slow initial synchronization

Situation

Cause

Workaround

Peer connection failures

Situation

Diagnosis

HTTP 503: service unavailable

Situation

Diagnosis

Cause

Solution

Unresponsive full node

Situation

Diagnosis

High CPU usage

Situation

Diagnosis

Cause

Resolution

I still need help

What's next?