Skip to content

CIP-0064: Delegateless Automation

FinalStandards Trackby Julien TinguelyCreated 21-04-2025Approved 20-06-2025
TL;DR

CIP-0064

Abstract

This CIP proposes to replace the elected SV delegate mechanism, to improve fault tolerance and more fully automate decentralized operations.
In the current approach, the Decentralized Synchronizer Operations (DSO) party, which represents the consensus of the Super Validators, delegates actions to a single Super Validator (SV) party known as the delegate. The delegate then submits those actions to the ledger.

In the proposed approach, the DSO party authorizes all the SV parties to submit the designated actions. The automation component of the SV application on each Super Validator node then automatically attempts to submit these delegated actions. Each application applies a random delay, tuned to minimize contention.

Specification

Daml

In the new implementation, all current choices on the DsoRules template having dsoDelegate as controller will introduce a new optional argument named sv to be used instead. This allows all SVs to exercise the choice by specifying their own party. Daml will reject, with an authorization error, any transaction where the SV application tries to specify any party other than its own.

SV Application

The SV application is modified such that all SVs attempt to submit the commands currently only being submitted by the delegate.

This introduces some potential for conflict: if multiple SV applications submit the same transaction at the same time-- for example, if all attempt to advance the mining rounds at the same time--only one of them will succeed, and the others will fail; that is, they will contend with each other. Importantly, this is not a correctness issue: The Daml code already handles this case; for example, it can successfully manage if the delegate crashes during a submission and tries to resubmit the same command. But the failed transactions would introduce unnecessary load on the network.

To minimize contention, each instance of the decentralized SV application introduces a random delay before attempting a task, and uses a retry mechanism with randomized backoff. To ensure that these delays are evenly distributed, scalable with the number of SVs, and remain within the polling interval, the delays are sampled from a uniform distribution: Uniform(0, max(n * expectedTaskDuration, pollingInterval)), where:

  • n is the number of SVs,
  • expectedTaskDuration is the estimated duration of a command execution (default: 5 seconds), and
  • pollingInterval is the interval between polling trigger calls (default: 30 seconds).

The retry mechanism uses exponential backoff (the delay doubles each time) with an initial delay of expectedTaskDuration, with a maximum backoff delay capped at 2 * expectedTaskDuration. The expectedTaskDuration value is configurable via the Helm chart, allowing adjustments if needed.

After the delay has passed, the SV application either succeeds submitting the command, fails because of contention or realises the command has already been submitted by another SV. Each automation trigger already includes a staleness check to determine if the associated task has already been completed or archived.

A new metric named splice_trigger_attempted_total is added to monitor trigger failures caused by contention and help in choosing good delay and backoff parameters.

SV UI

In the current implementation, there is a dedicated page in the SV UI used to elect a new delegate for the next round, intended to unlock the system if a delegate goes down. In the new implementation, delegate election and its dedicated page in the SV UI are fully removed and associated Daml choices cannot be executed anymore.

Motivation

A delegate performs critical network-related tasks, such as for example advancing mining rounds or triggering vote proposal executions when the required 2/3 voting thresholds is reached. In the current version, there is no automatic fault tolerance to mitigate issues that may occur on the node. SVs can vote on a new delegate, but this is a manual process, and the single point of failure is just moved to a different SV. A concrete problem is rounds stop progressing when the delegate is down. This has been observed multiple times on devnet and testnet and highlights the need for a better fault tolerance mechanism.

Rationale

In the proposed solution, all SV application instances cooperate to provide good fault tolerance guarantees as they attempt to complete tasks. The proposed solution introduces no correctness issues, as commands are already all idempotent.

Given that the current delegate-based choices do not require high throughput, e.g., round advancement happens only every 10 minutes, random delays and backoffs are sufficient to minimize contention.

Our approach does not degrade performance in practice, as it introduces only minimal overhead in SV app automation while significantly enhancing fault tolerance. Most tasks succeed on the first attempt, rendering the backoff mechanism largely inactive. Empirical benchmarks show that delegate transactions accounted for approximately 1% of total transactions on DevNet and 1.6% on MainNet, based on 24-hour sampling periods. The script used for this analysis is available in Splice under delegate_txs.py.

While more complex approaches to minimize contention would be possible, e.g., defining a clear order in which SVs should attempt to process a given task that is determined based on some task identifier, the implementation complexity of that does not seem to be justified.

Risks and mitigations

If for some reason an automation trigger ends up overloading with too much contention and causing cascading failures, the SV application allows pausing individual triggers. It can also be specified using the ADDITIONAL_CONFIG* environment variable.

Backwards compatibility

This new approach is only active when all SV operators have vetted the new governance dar file. All choices and backend code involved are backward compatible.

Reference implementation

The new implementation includes backend and Daml changes. An early version of the code is available in Splice branch: feature/delegateless-automation.

Copyright

This CIP is licensed under CC0-1.0: Creative Commons CC0 1.0 Universal

Changelog

  • 2025-04-21: Initial draft of the proposal.
  • 2025-06-03: Approved CIP