T'was the day before genesis, when all was prepared, geth was in sync, my beacon node paired. Firewalls configured, VLANs galore, hours of preparation meant nothing ignored.
Then all at once everything went awry, the SSD in my system decided to die. My configs were gone, chain data was history, nothing to do but trust in next day delivery.
I found myself designing backups and redundancies. Complicated systems consumed my fantasies. Thinking further I came to realise: worrying about these kinds of failures was quite unwise.
Events
The beacon chain has several mechanisms to encourage validator behavior, all of which depend on the current state of the network. It is therefore essential to consider these failure cases in the broader context of how other validators might fail when deciding what they are and what they are. are not great ways to secure your node(s).
As an active validator, your balance goes up or down, it never goes out of control*. Therefore, a fairly reasonable way to maximize your profits is to minimize your downsides. There are 3 ways to reduce your balance using Beacon Chain:
- Penalties are issued when your validator is missing one of its functions (for example because it is offline)
- Idle leaks are given to validators who fail to perform their functions while the network fails to finalize (i.e. when your validator being offline is highly correlated with other validators being offline)
- Cuts are given to validators that produce contradictory blocks or attestations and therefore could be used in an attack
* On average, a validator’s balance may remain the same, but for a given task they are either rewarded or punished.
Correlation
The effect of a single validator being offline or performing slashable behavior is small in terms of the overall health of the beacon chain. It is therefore not heavily sanctioned. On the other hand, if many validators are offline, the balance of offline validators can decrease much faster.
Similarly, if many validators perform slashable actions at the same time, from a tag chain perspective this is indistinguishable from an attack. It is therefore treated as such, and 100% of the stake of the offending validators is burned.
Because of these “anti-correlation” incentives, validators should be concerned more on failures that could affect others at the same time rather than on isolated, individual problems.
Causes and their probability.
So let’s think about a few failure cases and consider them based on how many others would be affected at the same time and how badly you punish your validators.
I disagree with @econoar here that these are worst case problems. These are more moderate level problems. Home UPS and Dual WAN address failures are not correlated with other users and therefore should be low on your list of concerns.
🌍 Internet/power outage
If you’re validating from home, there’s a good chance you’ll encounter one of these failures at some point in the future. Residential Internet and electrical connections do not have guaranteed availability. However, when the internet connection goes down or your electricity goes out, the outage is usually limited to your area and even then only for a few hours.
Unless you have very Spotty internet/electricity, may not be worth paying for backup connections. You will receive a few hours of penalties, but since the rest of the network is operating normally, your penalties will be roughly equal to what your rewards would have been over the same period. In other words, a k a one hour failure brings your validator balance back to roughly where it was k hours before the breakdown, and in k Overtime, your validator balance will return to its pre-failure amount.
(Validator #12661 recover ETH as quickly as it was lost – Beaconcha.in
🛠 Hardware failure
Just like Internet outages, hardware failures occur randomly, and when they do, your node may be unavailable for a few days. It is useful to consider the expected benefits over the life of the validator versus the cost of redundant hardware. Is the expected value of the failure (the offline penalties times the probability of it occurring) greater than the cost of the redundant hardware?
Personally, the chances of failure are low enough and the cost of fully redundant hardware high enough that it’s definitely not worth it. But then again, I’m not a whale 🐳; As with any failure scenario, you must evaluate how this applies to your particular situation.
☁️ Cloud services outage
Perhaps, to completely avoid the risk of hardware or internet failure, you decide to use a cloud provider. With a cloud provider, you have introduced the risk of correlated outages. The question that matters is, how many other validators use the same cloud provider as you?
A week before genesis, Amazon AWS experienced a prolonged outage which affected a large part of the Web. If something similar were to happen now, enough validators would go offline at the same time that inactivity penalties would take effect.
Worse yet, if a cloud provider duplicates the VM running your node and accidentally leaves the old and new nodes running at the same time, you could be cut (penalties would be particularly severe if this accidental duplication affected many other nodes) . Also).
If you insist on using a cloud provider, consider switching to a smaller provider. This could end up saving you a lot of ETH.
🥩 Staking Services
There is multiple staking services on mainnet today with varying degrees of decentralization, but they all contain an increased risk of correlated outages if you trust them with your ETH. These services are necessary components of the eth2 ecosystem, especially for those who own less than 32 ETH or lack the technical know-how to stake, but they are architected by humans and therefore imperfect.
If staking pools eventually grow as large as eth1 mining pools, then it is conceivable that a bug could result in massive reductions or inactivity penalties for their members.
🔗 Infura Failure
Last month, Infura went down for 6 hours cause outages in the Ethereum ecosystem; It’s easy to see how this is likely to lead to correlated failures for eth2 validators.
Additionally, third-party eth1 API providers necessarily limit the throughput of calls to their service: in the past, this prevented validators from producing valid blocks (on the Medalla testnet).
The best solution is to run your own eth1 node: you won’t experience throughput throttling, it will reduce the likelihood of your outages being correlated, and it will improve the decentralization of the network as a whole.
Eth2 clients have also started adding the ability to specify multiple eth1 nodes. This makes it easier to switch to a backup endpoint, in case your primary endpoint fails (Lighthouse: –eth1-endpointsPrism: PR#8062Nimbus & Teku will probably add support somewhere in the future).
I highly recommend adding backup API options as cheap/free insurance (EthereumNodes.com displays free and paid API endpoints and their current status). This is useful whether or not you run your own eth1 node.
🦏 Failure of a particular eth2 client
Despite all the code review, audits, and work by Rockstar, all eth2 clients have hidden bugs somewhere. Most of these are minor and will be caught before they become a major problem in production, but there is always a risk that the client you choose will go offline or cause you to lose their work. If this were to happen, you wouldn’t want to run a client with >1/3 of the nodes in the network.
You have to find a compromise between what you consider to be the best customer and their popularity. Consider reading another client’s documentation so that if something happens to your node, you know what to expect in terms of installing and configuring another client.
If you have a lot of ETH at stake, it’s probably worth managing multiple clients, each with a portion of your ETH to avoid putting all your eggs in one basket. Otherwise, Guarantor is an attractive offering for multi-node staking infrastructure, and Shared Secret Validators are experiencing rapid development.
🦢 Black Swans
There are of course many improbable, unpredictable, but dangerous scenarios that will always present a risk. Scenarios that lie outside of the obvious decisions about setting up your staking. Examples such as Spectrum And Merger at the hardware level, or kernel bugs such as BleedingTooth allude to some of the dangers that exist throughout the hardware stack. By definition, it is not possible to predict and entirely avoid these problems; you generally have to react after the fact.
What to worry about
Ultimately it comes down to calculating the expected value EX) of a given failure: how likely is it that an event will occur and what would be the sanctions if it did occur. It is essential to consider these failures in the context of the rest of the eth2 network since the correlation greatly affects the penalties incurred. Comparing the expected cost of failure to the cost of mitigating it will give you the rational answer as to whether it is worth fixing.
No one knows all the causes of node failure, nor the probability of each failure, but by individually estimating the chances of each type of failure and mitigating the greatest risks, the “wisdom of the crowd” will prevail and, on average, the network as a whole will make a good estimate. Additionally, due to the different risks each validator faces and the different estimates of those risks, failures that you have not accounted for will be detected by others and the degree of correlation will therefore be reduced. Yes, decentralization!
📕 NO PANIC
Finally, if something happens to your knot, don’t panic! Even when leaking inactivity, the penalties are low over short periods. Take a few moments to think about what happened and why. Then develop an action plan to resolve the problem. Then take a deep breath before continuing. An extra 5 minute penalty is better than a reduction because you did something unwise in a hurry.
The plus: 🚨 Don’t run 2 nodes with the same commit keys! 🚨
Thanks to Danny Ryan, Joseph Schweitzer and Sacha Yves Saint-Leger for their review
(Slashings because validators executed >1 node – Beaconcha.in)