Splice-validator-participant-1 keeps restarting during reconnect-participants on mainnet (v0.5.18)
We’re running a Canton validator node (v0.5.18) on mainnet and the splice-validator-participant-1 container keeps restarting in a loop.
Observed behavior:
- Participant starts and connects to 13 sequencers successfully
- Gets stuck on Task reconnect-participants still not completed
- CPU spikes to ~785% (nearly maxing all 8 cores), RSS grows to ~4.3 GB
- Participant crashes ~6 minutes after startup with no ERROR-level logs
- Validator loses its gRPC connection to participant:5002 and also restarts
- Cycle repeats
Server spec: 8 vCPU / 16 GB RAM, 100 GB data disk. Memory and disk are not the bottleneck.
Is this expected behavior during the initial ACS sync on mainnet? Does the reconnect-participants task eventually complete after enough retry cycles, or is there something we need to configure to stabilize the
participant?
The reconnect-participants task is the participant’s internal task that reestablishes its connections to the synchronizer’s sequencer connections after a restart, task stalls when the participant is under extreme resource pressure during the initial ACS commitment reconciliation process. crash at ~6 minutes with no ERROR logs seems a JVM out of memory kill which explains why you see nothing in the application logs.
I’d suggest a few things to try:
-
Increase the participant JVM heap as in add explicit heap flags to the participant container via
_JAVA_OPTIONSin your Docker Compose. -
High CPU usage is the JVM’s parallel GC threads competing with each other so capping container CPU to like 6 cores forces the JVM to use fewer GC threads and often results in faster overall startup because GC becomes less chaotic.
-
ACS commitment processing is also db heavy so if Postgres is on the same 100 GB data disk with standard IOPS it can become a bottleneck that backs up the participant’s memory queues.