Full node troubleshooting
Goal
This article will guide you to troubleshoot Hathor full node. Each of the following sections provides resolution to a problem you may encounter while operating and using it.
Initialization failure: unreliable storage
Situation
You ran the command to start the full node and received the following error message:
[error][hathor.manager] Error initializing node. The last time you executed your full node it wasn't stopped correctly. The storage is not reliable anymore and, because of that, so you must run a full verification or remove your storage and do a full sync.
Cause
It happens when the full node's database (specifically the ledger) is in an unreliable state. As explained by the message, this occurs when the full node does not shut down properly — e.g., the process was abruptly killed, or the machine turned off.
Solution
Restart your full node from an empty database. This solution entails the initial synchronization of the ledger from genesis block, which naturally takes a long time. However, one can expedite it by using a snapshot (of the ledger).
If you want to carry out this solution using a snapshot, see Slow initial synchronization.
If you want to carry out this solution without using a snapshot, follow this procedure:
- Start a shell session.
- Change the working directory to
<absolute_path_hathor_full_node>/data
, replacing the<absolute_path_hathor_full_node>
placeholder with the directory where you installed your full node. - Remove all content of the
data
subdirectory. - Restart your full node as usual.
If you are using the ledger events produced by your full node, your event-consuming client application will need to discard all events and reprocess them from scratch.
The ledger events generated by the full node are specific to a database instance. Therefore, whenever you restart from an empty database, the events logged in event-consuming client applications of the full node will no longer be valid.
To know more about this, see the warning box in section Initialization failure: event queue disabled.
Initialization failure: event queue disabled
Situation
You ran the command to start the full node and received the following error message:
hathor.exception.InitializationError: Cannot start manager without event queue feature, as it was enabled in the previous startup. Either enable it, or use the reset-event-queue CLI command to remove all event-related data
Cause
The last time this full node ran, the ledger events producer was enabled. Now, you are trying to start it with this feature disabled. However, the full node does not initialize with the feature disabled as long as its database still contains previously generated ledger events.
Solutions
There are two possible solutions:
- Enable ledger events producer.
- Discard ledger events from database.
Solution (1) means restarting the full node with the ledger events producer enabled. In a production environment, this is almost always the preferred alternative. Even if your full node does not need to provide ledger events to clients now, it may be required in the future, and the additional resource consumption to keep the feature running is negligible.
Solution (2) means discarding the ledger events from the full node's database. It is mostly used for adjustments in development and test environments or in rare cases where the database has entered an unreliable state, and the ledger events have become unusable.
Solution 1: enable ledger events producer
See Enable ledger events producer at Full node key configurations.
Solution 2: discard ledger events from database
- Start a shell session.
- Change the working directory to where you installed your full node — namely,
hathor-core
and database,docker-compose.yml
and database, or just the database. - Run the command to discard the ledger events from the database. For example:
- Source code
- Docker
cd hathor-core && poetry run hathor-cli reset-event-queue --data ../data
docker run \
-v ${PWD}/data:/data \
hathornetwork/hathor-core \
reset-event-queue --data /data
Note that subcommand hathor-cli reset-event-queue
requires option --data <your_full_node_database>
, where placeholder <your_full_node_database>
refers to the directory where you stored your full node's database — typically, data
.
If the procedure is successful, you will receive the following informational message:
[info][hathor.cli.reset_event_queue] removing all events and related data...
[info][hathor.cli.reset_event_queue] reset complete
If the full node discards its ledger events database and starts a new one, client applications must do the same. Otherwise, they will no longer receive new event notifications.
For example, suppose your full node operated with the ledger events producer enabled. During this period, several client applications connected to your full node’s async API and subscribed to be notified of these events. However, after a full node failure, it was necessary to sync from scratch with the network. As a result, the full node generated a new ledger event database from scratch. To receive event notifications again, client applications need to discard their event database and restart from scratch to sync with the full node’s new event database.
Peer discovery failure
Situation
Your full node is initializing but stalls during the peer discovery phase. It logs one or more of the following error messages:
[error][hathor.p2p.peer_discovery.dns] errback extra={'dns_seed_lookup_text', 'alpha.nano-testnet.hathor.network'} result=<twisted.python.failure.Failure builtins.OSError: [Errno 65] No route to host>
Cause
The full node is failing to perform the DNS query. This is a known issue in Hathor core that occurs only in very specific environments, typically related to OS and home LAN configurations of personal machines.
The DNS provides a set of peer IPs that serves as the seed for peer discovery. Without this seed, the full node is unable to connect to the network, as it does not know any of its peers.
Workaround
Until this issue is resolved, you can manually work around it by querying the network's DNS and providing the retrieved IPs to Hathor core as seed for peer discovery. To carry out this workaround, follow this procedure:
-
Select the DNS corresponding to the Hathor Network instance you want to connect to:
- Mainnet:
mainnet.hathor.network
- Testnet:
golf.testnet.hathor.network
- Nano-testnet:
alpha.nano-testnet.hathor.network
- Mainnet:
-
Start a shell session.
-
Query the DNS of the selected network. For example:
dig TXT golf.testnet.hathor.network
In
ANSWER SECTION
you will obtain the IPs to be used. For example:...
;; ANSWER SECTION:
golf.testnet.hathor.network. 60 IN TXT "tcp://18.156.174.211:40403/?id=6d6d72156f20d294c6677a8963ebe70df66b5beaf12773c16de250f8275fb6c5"
golf.testnet.hathor.network. 60 IN TXT "tcp://18.199.240.217:40403/?id=e4466f8e05e93dc7b077af3807830bee296936772033b73ee32da59e5400d8fd"
golf.testnet.hathor.network. 60 IN TXT "tcp://34.230.30.110:40403/?id=ffcc778abd0cf1be33062bbdaa48f9909e2e5a2947390efc72070c38c0505e69"
... -
Restart your full node using one or more times the
--bootstrap
option/environment variable, once for each IP you want to provide as seed. For example:
- Source code
- Docker container
- Docker compose
poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index --bootstrap tcp://18.156.174.211:40403 --bootstrap tcp://18.199.240.217:40403 --bootstrap tcp://34.230.30.110:40403
docker run \
-it -p 8080:8080 -v <absolute_path_hathor_full_node>/data:/data \
hathornetwork/hathor-core \
run_node --status 8080 --testnet --data /data --wallet-index --bootstrap tcp://18.156.174.211:40403 --bootstrap tcp://18.199.240.217:40403 --bootstrap tcp://34.230.30.110:40403
services:
hathor-core:
image: hathornetwork/hathor-core
command: run_node
ports:
- "8080:8080"
- "8081:8081"
volumes:
- <absolute_path_hathor_full_node>/data:/data
environment:
- HATHOR_STATUS=8080
- HATHOR_STRATUM=8081
- HATHOR_TESTNET=true
- HATHOR_DATA=/data
- HATHOR_WALLET_INDEX=true
- HATHOR_CACHE=true
- HATHOR_CACHE_SIZE=100000
- HATHOR_BOOTSTRAP=tcp://18.156.174.211:40403
- HATHOR_BOOTSTRAP=tcp://18.199.240.217:40403
- HATHOR_BOOTSTRAP=tcp://34.230.30.110:40403
...
Note that providing a single IP is sufficient as a seed for peer discovery, while multiple IPs ensure that the full node can successfully find an available peer. Additionally, option/environment variable --bootstrap
must receive exactly one argument per use. Therefore, you must append it as many times as the number of IPs you want to provide.
Instead of manually querying the DNS, you can embed it directly into the command used to start your full node. For example:
HATHOR_DNS=golf.testnet.hathor.network; poetry run hathor-cli run_node --status 8080 --testnet --data ../data --wallet-index $(dig TXT $HATHOR_DNS +short | sed 's/"/--bootstrap /' | sed 's/"//')
Unable to connect to mainnet
Situation
You have just started a full node, and it is attempting to connect to its peers in mainnet. It then begins logging one or more of the following warning messages:
[warning][hathor.p2p.protocol] remote error payload=Blocked (by <peer_id_of_your_full_node>). Get in touch with Hathor team. peer_id=None remote=<IP_of_some_peer>:40403
Diagnosis
Send an HTTP API request to check the status of the full node. For example:
curl -X GET http://localhost:8080/v1a/status/ | jq .connections
In the API response, look for the connections
object. If its properties connected_peers
, handshaking_peers
, and connecting_peers
, all have empty arrays, it means your full node is unable to connect to any other peer (which means it is not connected to the network):
{
"connected_peers": [],
"handshaking_peers": [],
"connecting_peers": []
},
Cause
At the moment, Hathor Network mainnet operates with a whitelist — i.e., only peers whose id is in the whitelist are able to connect to the network. The warning message(s) your full node received means that one or more peers rejected the connection because your full node's peer_id is not in the whitelist.
Solution
See Select network instance at Full node key configurations.
Slow initial synchronization
Situation
You have successfully connected your full node to Hathor Network, but syncing with its peers is taking too much time.
Cause
The initial synchronization is a process that happens when you deploy a new full node, or when you restart a full node after some period offline. In case of a new deploy, it will need to sync the entire ledger from genesis block. In case of restarting a full node that was offline for a long time, it will need to sync the ledger from the point it stopped.
In either case, this processes naturally takes a long time (hours), because the full node must download and validate all transactions and blocks in the ledger
As of February 2024, syncing from genesis block takes on average 10 hours for Hathor Network testnet and 24 hours for mainnet. As time passes and the ledger grows, the time required for initial syncing tends to increase.
Workaround
To expedite this process, you can bootstrap your full node with a snapshot. Snapshots allow nodes to rapidly catch up with the network. The trade off is that your full node will be relying on the snapshot to create its ledger, rather than making the entire validation process on its own.
To use this workaround, see How to bootstrap full node with a snapshot.
To know more about snapshots, see Snapshot at encyclopedia.
Peer connection failures
Situation
Your full node is currently operational and is logging one or more of the following warning messages:
[warning][hathor.p2p.protocol] remote error payload=Connection rejected. peer_id=None remote=<IP_of_some_peer>:40403
This means that a peer responded by rejecting the connection.
[warning][hathor.p2p.manager] connection failure endpoint=tcp://<IP_of_some_peer>:40403 failure=User timeout caused connection failure.
This means that a peer did not respond to the connection request.
[warning][hathor.p2p.protocol] Connection closed for idle timeout. peer_id=None remote=<IP_of_some_peer>:54254
Connection failures are a normal aspect of a full node's ongoing operation. As long as your full node remains well-connected to the network, these messages should not be a cause for concern.
Diagnosis
To determine if your full node is well-connected to the network, send an HTTP API request to check its status. For example:
curl -X GET http://localhost:8080/v1a/status/ | jq .connections
In the API response, look for the connections
object. Count how many objects the connected_peers
property has:
{
"connected_peers": [
{
"id": "<connected_peer_id_1>",
...
},
{
"id": "<connected_peer_id_2>",
...
},
...
{
"id": "<connected_peer_id_n>",
...
},
],
...
},
To be considered well-connected, a full node should average around 20 connections on mainnet, or 5 to 10 on testnet.
HTTP 503: service unavailable
Situation
You sent an HTTP API request to the full node and received the following status message as response: Server Error 503: Service Unavailable
.
Diagnosis
Ensure that the Server Error 503: Service Unavailable
status message is originating from the full node itself, not from a reverse proxy.
Cause
If the full node itself is responding with a status code 503, this means that it has been started without the wallet-index
parameter. As a result, it cannot process requests that depend on this parameter for proper functioning.
Solution
Restart your full node with the --wallet-index
option/environment variable. For example:
- Source code
- Docker container
- Docker compose
poetry run hathor-cli run_node --testnet --data ../data --status 8080 --wallet-index
docker run \
-it -p 8080:8080 -v ${PWD}/data:/data \
hathornetwork/hathor-core \
run_node --testnet --data /data --status 8080 --wallet-index
...
services:
full-node:
image: hathornetwork/hathor-core
ports:
- "8080:8080"
volumes:
- ${PWD}/data:/data
environment:
- HATHOR_TESTNET=true
- HATHOR_DATA=/data
- HATHOR_STATUS=8080
- HATHOR_WALLET_INDEX=true
- HATHOR_CACHE=true
- HATHOR_CACHE_SIZE=100000
networks:
hathor:
aliases:
- full-node
command: run_node
...
Unresponsive full node
Situation
Your full node was normally responding to your API requests but then suddenly became unresponsive. This typically manifests with one or more of the following error messages:
request timed out
connection timed out
connection reset by peer
unable to connect to the server
Diagnosis
Check the host to ensure the full node is still up and running. If so, this might indicate that your full node is experiencing high CPU usage. See the section High CPU usage of this article.
High CPU usage
Situation
Your full node is presenting one or more of the following symptoms:
- It suddenly becomes unresponsive to API requests.
- It suddenly rejects all new connections with other peers.
- It suddenly drops established connections with its peers.
Diagnosis
When these symptoms appear together, they indicate that your full node is experiencing high CPU usage, which means zero or near-zero CPU idle time. Use a utility — such as top, htop, vmstat, or mpstat —, to confirm high CPU usage on the full node's host.
Cause
The most common cause of high CPU usage is processing API requests related to addresses with a high number of transactions. Some use cases may involve many such addresses and require their full nodes to process multiple requests for them simultaneously. If this happens in your use case, it is likely the root cause of the problem.
Resolution
If this is the case for your full node, the resolution may vary depending on your use case. Otherwise, a deeper investigation will be necessary to identify the root cause of the problem. Regardless, send a message to the #development
channel on the Hathor Discord server for assistance from the Hathor team and community members.
I still need help
If this article does not address your problem, or if the provided instructions were insufficient, send a message to the #development
channel on Hathor Discord server for assistance from Hathor team and community members.
What's next?
-
Full node pathway: to know how to operate this application.