CIP-TBD: Canton Ecosystem Status Page & Incident Transparency
Abstract
This CIP proposes the creation of a Canton Ecosystem Status Page and a complementary Slack channel to provide real-time visibility into network health, outages, scheduled maintenance, and incident resolution across the Global Synchronizer and connected infrastructure. The goal is to eliminate redundant back-and-forth communication during technical incidents, increase transparency between Super Validators, Validators, and application builders, and enable potential contributions to incident awareness.
Motivation
As the Canton Network matures and onboards more validators, application providers, and institutional participants, the need for a centralized, authoritative source of network health information has become critical. Today, when outages or degraded performance occur on the Global Synchronizer — whether during planned upgrades (e.g., Major Upgrades with Downtime per CIP-0062, CIP-0089) or unplanned incidents — builders and validators must rely on fragmented communication across mailing lists, Slack threads, and ad-hoc messages to understand what is happening, what is affected, and what the resolution timeline looks like.
This creates several problems:
For Builders: Application developers building on Canton (e.g., trading platforms, settlement systems, custody integrations) cannot easily distinguish between an issue in their own stack versus a network-level incident. Without a canonical status page, engineering teams waste hours debugging local infrastructure before discovering the root cause is upstream.
For Validators: Validator operators, including Node-as-a-Service providers hosting on behalf of institutions, need timely incident data to communicate with their own customers.
For Super Validators: During incidents, Super Validator operators are often inundated with the same questions from multiple parties simultaneously. A canonical status source would dramatically reduce this operational overhead and let SV ops teams focus on resolution rather than communication.
Specification
1. Public Status Page
Deploy a publicly accessible status page (e.g.,
status.canton.networkorstatus.sync.global) that provides:Component Monitoring:
- Global Synchronizer (sequencer health, round progression)
- Scan API and Validator API availability
- SV/Validator availability and response times
- DevNet, TestNet, and MainNet environments (displayed separately)
Incident Management:
- Real-time incident creation with severity levels (Operational, Degraded Performance, Partial Outage, Major Outage)
- Timestamped status updates throughout incident lifecycle (Investigating → Identified → Monitoring → Resolved)
- Post-incident summaries with root cause analysis
- Scheduled maintenance windows with advance notice (minimum 48 hours for planned upgrades, consistent with existing Major Upgrade with Downtime procedures)
Subscription Mechanism:
- Email, Slack, webhook, and RSS subscription options for status updates
- Granular subscriptions (e.g., MainNet-only, specific components)
2. Slack Channel with Webhook Integration
Create a dedicated, Slack channel (e.g.,
#canton-status) within the Canton Network Slack workspace that:- Receives automated posts from the status page via webhook whenever an incident is created, updated, or resolved
- Posts scheduled maintenance reminders at 48h, 24h, and 1h before planned windows
- Includes links back to the full status page for detailed information and historical context
Implementation
Platform
The status page should be deployed using an established status page platform (e.g., Atlassian Statuspage, Instatus, or an open-source solution like Cachet or Upptime) that supports:
- Webhook integrations for Slack and other messaging platforms
- API access for programmatic incident creation and status queries
- Custom domain hosting under a canton.network or sync.global subdomain
- SSO integration for administrator access
Phase 1 (Within 30 days of CIP approval):
- Deploy status page with manual incident management for MainNet
- Create the Slack channel with webhook integration
- Establish incident severity definitions and escalation procedures
Phase 2 (Within 90 days of CIP approval):
- Add DevNet and TestNet environments
- Implement automated health checks for key components (round progression, API availability)
- Enable subscriber notifications (email, webhook, RSS)
Phase 3 (Within 120 days of CIP approval):
- Integrate automated monitoring that can trigger incident creation based on anomaly detection
- Publish a public SLA dashboard with historical uptime metrics
- Evaluate integration with the Canton Scan explorer for correlated visibility
Governance
- The SV Operations Committee (CIP-0060) is responsible for administering the status page and approving incident publications
- Any Super Validator operator can create an incident
This CIP is licensed under CC0-1.0: Creative Commons CC0 1.0 Universal.