Infrastructure

[Infrastructure] Lessons Learned from Solving SSV Performance Issues

 on 
August 23, 2024
Disclaimer: A41 is a crypto infrastructure company that provides node operation across various blockchains. It primarily provides services using bare-metal servers at Tier III & Tier IV data centers in South Korea. The data used in this article is based on information collected from A41 validator nodes, and results may vary depending on the operating environment.

TL;DR

  • We examine the consensus process among operators in the SSV network and identify key factors that may impact performance.
  • By comparing operator performance across different environments—such as Korean data centers, AWS instances in Korea, and OVH in Germany—we investigate the performance issues faced by SSV operators in Korean data centers and explore their underlying causes.
  • We examine the rationale for restricting P2P communication within the APAC region's IP range under the current circumstances. Additionally, we highlight why, in the long term, achieving a more geographically balanced distribution of SSV operators is crucial for fundamentally resolving these issues.

1. SSV Consensus

Source: https://ssv.network/blog/technology/ssv-protocol-implementation-deep-dive/

When participating in Ethereum consensus through the SSV network, the process follows the steps shown in the diagram and is divided into three main phases: Pre-consensus, Consensus, and Post-consensus.

  1. Pre-consensus: Blue Steps, Fetch Duty Data
  2. Consensus: Yellow Steps, SSV Consensus
  3. Post-consensus: Orange Steps, Aggregate & Submit

The main focus should be on the Yellow Steps. SSV relies on a BFT (Byzantine Fault Tolerant) algorithm to reach consensus among operators, following the Proposal-Prepare-Commit process.

In this process, not all operators participating in the SSV network agree on a single state. The subnet is determined based on the Ethereum validator that each operator manages, and consensus is reached among operators subscribed to that specific subnet. Note that the number of subnets is limited to a maximum of 128. Once an operator's Ethereum validator is determined, the required subnet for subscription is also decided accordingly.

Source: https://ssv.network/blog/technology/introduction-to-the-first-ssv-network-fork/

Therefore, the performance of an operator can vary depending on the state of the peers that make up the subnet. In other words, even if an operator is running multiple Ethereum validators, the consensus performance may differ because each validator belongs to a different subnet.

2. The Problem We Faced

Among the SSV operators we manage, four are operating in Korean data centers. These four operators form a single SSV cluster and manage 126 Ethereum validator keys. When discussing duties such as Attester, based on the consensus process previously analyzed, the consensus in the SSV network should be completed within 8 seconds. To achieve this, the machine resources on which each operator runs should not become bottlenecks, and the network latency involved in communication between peers within the subnet should be minimal.

The graph above shows the average attestation success rate data for the four operators running in Korea. Overall, the performance is unstable, and the values fluctuate significantly depending on the situation. We aimed to identify the key bottleneck by systematically eliminating variables that could affect performance.

First, we examined the possibility of insufficient computational resources. In the current setup, a single bare-metal machine with 32 cores and 128 GB RAM is running the Ethereum client and four SSV operators. Based on the constant alerts and error logs, we identified possibility of  resource bottleneck in the Ethereum consensus layer client. To address this, we added another machine with the same specifications and distributed the clients. However, performance remained largely unchanged, and given that the machine’s capabilities are far beyond what is required for running the SSV client, we concluded that this was not the main issue.

Next, we considered the potential issues related to running operators in the Korean data center. The main focus was on improving network latency during SSV peer communication. We undertook several direct and indirect measures to address this and will detail each approach in the following sections.

3. The Attempts

3.1. Comparison by Operating Environment

First, we wanted to assess how running SSV operators in different environments would affect actual performance. The main points we aimed to verify are as follows:

  1. Is performance impacted by network latency due to most peers being geographically distant from Korea?
  2. Is the available bandwidth for outbound international traffic at the data center a bottleneck?

To test the first point, we operated an operator in the EU region(where many peers are located), and collected data. For the second point, we operated an operator in an instance within the AWS Seoul region, which has a high SLA level for outbound international traffic bandwidth, and collected data. The results are summarized in the following graph.

Even within the same region in Korea, there were significant performance differences depending on the resources of the operating environment. Additionally, the regional difference between Korea and Germany was also evident. The key takeaway from these results is that "the outbound international bandwidth resources of the data center are a major bottleneck, and performance is also influenced by the geographic location of the peer group." One crucial point to consider is that data centers in Korea primarily operate domestic services, so they are not optimized for outbound international traffic, and the SLA contracts between ISPs and data centers are generally at a relatively low level. Considering all these factors, we aim to find appropriate solutions by adjusting the SSV client configuration or making changes to the operating environment.

3.2. Controlling Peers Count

In SSV Client version 1.3.7, there are two ways to adjust the number of peers:

  1. Overall peer limit (default: 60)
  2. Per-subnet peer limit (default: 10)

Increasing the number of peers is expected to improve consensus performance by reducing the RTT (Round Trip Time) required for SSV consensus. Having more peers allows for communication with operators in a wider range of regions, increasing the likelihood of connecting with peers geographically closer to Korea.

Since SSV consensus happens on a subnet-by-subnet basis, it is important to increase not only the overall peer limit but also the per-subnet peer limit. Therefore, we increased the overall peer limit to 200 and the per-subnet peer limit to 20, then monitored the performance trends over several days.

After increasing the number of peers, the consensus time occasionally showed results in the 3-4 second range, but in most cases, it remains in the 8-9 second range. This indicates that while there are instances of faster consensus depending on the peer group, the majority of peers are still geographically distant, and the data center’s outbound international traffic bandwidth is not optimized. As a result. there was no significant improvement in the average attestation success rate, and bandwidth usage at the data center increased by approximately 30%.

3.3. Using AWS Instance as Relay

Maximizing the use of data center machines while achieving optimal consensus performance proved difficult through simple SSV client configuration adjustments. The main issue was the insufficient availability of outbound international traffic bandwidth resources at the data center. To address this, we decided to use AWS instances with a higher level of SLA.

For this setup, it is crucial to ensure peer connections between the operators running on AWS instances and those operating in the data center. Utilizing the recently released TrustedPeers setting in SSV Client version 1.3.8, it is now possible to prioritize connections to specific peers. The new SSV client configuration and structure we are testing are as follows.

# ssv/config.yaml p2p: # ... TrustedPeers: - /ip4//tcp//p2p/ - /ip4//tcp//p2p/ - /ip4//tcp//p2p/ - /ip4//tcp//p2p/

The interconnection of the four operators are prioritized using the TrustedPeers setting. To address the traffic bottleneck caused by the limited outbound international bandwidth at the data center, we leveraged the AWS instance’s bandwidth as a relay.

The graph above shows the operator data after running the new setup for several hours. Initially, for the first three hours, the SSV consensus time was less than 1 second, achieving over 98% performance. However, as the number of peers reached the limit and time passed, the SSV client’s PubSubScoring mechanism started operating, continuously updating the peer group to optimize the configuration. At this point, peers specified in TrustedPeers are not guaranteed to always remain in the peer group and may be excluded. If this happens, the use of the AWS instance as a relay becomes irrelevant, leading to a significant increase in consensus time. Ultimately, even when using the AWS instance as a relay, without settings like StaticPeers, there is a limit to how much performance can be improved.

3.4. Accepting Only APAC Peers

The core issue we are facing is “improving the network TTL required for consensus when international bandwidth is not optimized for Korean data centers.” When we relate this issue to the configuration of the SSV network, the majority of operators are located in the EU and US regions, leading to a peer group mostly composed of operators far from Korea, resulting in inefficient routing.

Ideally, in the SSV client’s PubSubScoring mechanism, the peer group should be configured with fast network response times as the key criterion. However, this would require modifications to the libp2p library and the client itself, which poses challenges for immediate implementation. Therefore, we have implemented firewall policies that ensure the peer group is composed of operators located geographically close to Korea within the APAC region. This approach allows us to use the data center’s outbound international bandwidth resources more efficiently.

The SSV consensus time has decreased to the millisecond range, and the attestation success rate has also improved to around 98%. Another noteworthy observation is that the number of operator peers remains in the mid-30s, well below the peer limit of 60. This indicates that the number of operators operating within the APAC region is relatively low. Moreover, since the SSV client’s TrustedPeers settings are in place and the peer limit hasn’t been reached, peering between operators running on the same machine is maintained, resulting in shorter consensus times. The comparison between the initial data center setup, the current setup, and the setup used in Germany is as follows.

By focusing on peering with operators in the APAC region and prioritizing peering between operators located on the same machine, we achieved significant improvements compared to the initial configuration, leveling most EU operators’ performance.

4. Conclusion

As we operate as a validator across various blockchains, we frequently encounter technical challenges due to the majority of operators being concentrated in the EU or US regions. For instance, when participating in Ethereum’s Sync Committee, performance can suffer if most validators are geographically distant. Additionally, for blockchains like Sei or Injective, which have block times of less than one second, it often becomes necessary to run nodes in the EU region to maintain optimal performance as a validator.

Nevertheless, as highlighted in this article, A41 prioritizes operating validators in the APAC region and is actively working toward regional decentralization. Despite utilizing Tier III and Tier IV data centers in Korea, when transmission, validation, and consensus need to be completed quickly, the limited number of peers in the APAC region can lead to issues with outbound international traffic from these data centers.

SSV is one of the Ethereum DVT implementations that upholds the philosophy and values of Ethereum decentralization. For the SSV network to continue maturing, more operators need to be active in the APAC region, and configuration options like StaticPeers should be supported. In collaboration with SSV, A41 remains committed to operating validators in the APAC region, conducting experiments, and exploring solutions to contribute to regional decentralization and the advancement of a stronger Ethereum and DVT ecosystem.

5. Appendix

5.1. Performance Comparison Table

IDC in Korea with All Peers

IDC in Korea with APAC Peers

AWS in Korea

OVH in Germany

Related Articles