Skip to main content

Hathor full node troubleshooting

Goal

This article will guide you to troubleshoot Hathor full node. Each of the following sections provides resolution to a problem you may encounter while operating and using Hathor full node.

Error initializing node

Situation

You ran the command to start the full node and received the following error message:

[error][hathor.manager] Error initializing node. The last time you executed your full node it wasn't stopped correctly. The storage is not reliable anymore and, because of that, so you must run a full verification or remove your storage and do a full sync.

Cause

It happens when the full node's database (namely, the ledger) is in an unreliable state. As explained by the message, this occurs when the full node does not shut down properly — e.g., the process was abruptly killed, or the machine turned off.

Solutions

There are two possible solutions:

  1. Execute a full verification of the database.
  2. Restart from an empty database.

Solution (1) is a process that takes time. In turn, solution (2) entails the initial synchronization of the ledger from genesis block, which is a process that takes even longer. However, one can expedite solution (2), by using a snapshot (of the ledger), making it by far the fastest approach.

Therefore, we recommend restarting from an empty database and using a snapshot, combining solution (2) with the procedure of section Slow initial synchronization of this article.

Solution 1: execute a full verification of the database

If you installed your full node from source code, run the command to start it using the --x-full-verification option. For example:

poetry run hathor-cli run_node --testnet --wallet-index --status 8080 --data ../data --x-full-verification

The next time you start your full node, it will execute a full verification of the database. This is necessary only once. Therefore, be sure to remove the full verification option/environment variable from your deployment configuration afterward. Otherwise, the full node will execute the full verification process every time it restarts, even if the database is already in a reliable state.

If the full verification of the database fails, use solution 2.

Solution 2: restart from an empty database

If you want to carry out this solution using a snapshot, see Slow initial synchronization.

If you want to carry out this solution without using a snapshot, follow this procedure:

  1. Start a shell session.
  2. Change the working directory to <absolute_path_hathor_full_node>/data, replacing the <absolute_path_hathor_full_node> placeholder with the directory where you installed your full node.
  3. Remove all content of the data subdirectory.
  4. Restart your full node as usual.
warning

If you are using the event subsystem of your full node, your event-consuming client application will need to discard all events and reprocess them from scratch.

The events generated by the event subsystem of your full node are specific to a database instance. Therefore, whenever you restart from an empty database, the events logged in event-consuming client applications of the full node will no longer be valid.

Unable to connect to mainnet

Situation

You have just started a full node, and it is attempting to connect to its peers in mainnet. It then began to receive one or more of the following warning messages:

[warning][hathor.p2p.protocol] remote error payload=Blocked (by <peer_id_of_your_full_node>). Get in touch with Hathor team. peer_id=None remote=<IP_of_some_peer>:40403

Diagnosis

Send an HTTP API request to check the status of the full node. For example:

curl -X GET http://localhost:8080/v1a/status/ | jq .connections

In the API response, look for the connections object. If its properties connected_peers, handshaking_peers, and connecting_peers, all have empty arrays, it means your full node is unable to connect to any other peer (which means it is not connected to the network).

status HTTP API response
{
"connected_peers": [],
"handshaking_peers": [],
"connecting_peers": []
},

Cause

At the moment, Hathor Network mainnet operates with a whitelist — i.e., only peers whose id is in the whitelist are able to connect to the network. The warning message(s) your full node received means that one or more peers rejected the connection because your full node's peer_id is not in the whitelist.

Solution

See How to connect Hathor full node to mainnet.

Slow initial synchronization

Situation

You have successfully connected your full node to Hathor Network, but syncing with its peers is taking too much time.

Cause

The initial synchronization is a process that happens when you deploy a new full node, or when you restart a full node after some period offline. In case of a new deploy, it will need to sync the entire ledger from genesis block. In case of restarting a full node that was offline for a long time, it will need to sync the ledger from the point it stopped.

In either case, this processes naturally takes a long time (hours), because the full node must download and validate all transactions and blocks in the ledger

As of February 2024, syncing from genesis block takes on average 10 hours for Hathor Network testnet and 24 hours for mainnet. As time passes and the ledger grows, the time required for initial syncing tends to increase.

Workaround

To expedite this process, you can bootstrap your full node from a snapshot. Snapshots allow nodes to rapidly catch up with the network. The trade off is that your full node will be relying on the snapshot to create its ledger, rather than making the entire validation process on its own.

To use this workaround, see How to bootstrap from a snapshot.

To know more about snapshots, see Snapshot at encyclopedia.

Connection failure

Situation

Your full node is currently operational and logs one or more of the following warning messages:

[warning][hathor.p2p.protocol] remote error payload=Connection rejected. peer_id=None remote=<IP_of_some_peer>:40403

This means that a peer responded by rejecting the connection.

[warning][hathor.p2p.manager] connection failure endpoint=tcp://<IP_of_some_peer>:40403 failure=User timeout caused connection failure.

This means that a peer did not respond to the connection request.

[warning][hathor.p2p.protocol] Connection closed for idle timeout. peer_id=None remote=<IP_of_some_peer>:54254

Connection failures are a normal aspect of a full node's ongoing operation. As long as your full node remains well-connected to the network, these messages should not be a cause for concern.

Diagnosis

To determine if your full node is well-connected to the network, send an HTTP API request to check its status. For example:

curl -X GET http://localhost:8080/v1a/status/ | jq .connections

In the API response, look for the connections object. Count how many objects the connected_peers property has:

status HTTP API response
{
"connected_peers": [
{
"id": "<connected_peer_id_1>",
...
},
{
"id": "<connected_peer_id_2>",
...
},
...
{
"id": "<connected_peer_id_n>",
...
},
],
...
},

To be considered well-connected, a full node should average around 20 connections on mainnet, or 5 to 10 on testnet.

HTTP 503: service unavailable

Situation

You sent an HTTP API request to the full node and received the following status message as response: Server Error 503: Service Unavailable.

Diagnosis

Ensure that the Server Error 503: Service Unavailable status message is originating from the full node itself, not from a reverse proxy.

Cause

If the full node itself is responding with a status code 503, this means that it has been started without the wallet-index parameter. As a result, it cannot process requests that depend on this parameter for proper functioning.

Solution

Restart your full node with the wallet-index option/environment variable.

If you installed your full node from source code, restart it using the --wallet-index option. For example:

poetry run hathor-cli run_node --testnet --wallet-index --status 8080 --data ../data

Unresponsive full node

Situation

Your full node was normally responding to your API requests but then suddenly became unresponsive. This typically manifests with one or more of the following error messages:

  • request timed out
  • connection timed out
  • connection reset by peer
  • unable to connect to the server

Diagnosis

Check the host to ensure the full node is still up and running. If so, this might indicate that your full node is experiencing high CPU usage. See the section High CPU usage of this article.

High CPU usage

Situation

Your full node is presenting one or more of the following symptoms:

  • It suddenly becomes unresponsive to API requests.
  • It suddenly rejects all new connections with other peers.
  • It suddenly drops established connections with its peers.

Diagnosis

When these symptoms appear together, they indicate that your full node is experiencing high CPU usage, which means zero or near-zero CPU idle time. Use a utility — such as top, htop, vmstat, or mpstat —, to confirm high CPU usage on the full node's host.

Causes

There are two well-established causes for high CPU usage in a full node:

  1. Using version 1 of the synchronization algorithm.
  2. Using addresses with a high number of transactions.

Synchronization is the process by which all nodes of a blockchain network maintain the same copy of the ledger. The first version of the synchronization algorithm implemented in Hathor protocol may consume a lot of CPU time when the full node is connected to a high number of peers in the network. To solve this problem, Hathor protocol was updated with a new version of synchronization algorithm (version 2), which has been the default since Hathor core v0.59.0.

Processing API requests related to addresses with a high number of transactions consumes a significant amount of CPU time of a full node. Some use cases may involve many of these addresses and may require its full node to process multiple requests related to such addresses simultaneously. This can lead to high CPU usage in the use case's full node.

Resolutions

Resolution for cause 1 (sync algorithm v1)

If you are running a full node with Hathor core v0.58.0 or earlier, update it to v0.59.0 or later. See How to upgrade Hathor full node.

Resolution for cause 2 (addresses with high number of transactions)

If you already upgraded Hathor core to v0.59.0 or later, and are still experiencing high CPU usage, chances are that the problem is related to responding API requests involving addresses with a high number of transactions — e.g., calculating the balance or history of such addresses. If this is the case for your full node, a resolution may vary depending of your use case. Send a message to the #development channel on Hathor Discord server for assistance from Hathor team and community members.

I still need help

If this article does not address your problem, or if the provided instructions were insufficient, send a message to the #development channel on Hathor Discord server for assistance from Hathor team and community members.

What's next?