Some services are down
Resolved
Jan 04 at 07:40pm CET
Incident Postmortem: Domain Accessibility Outage
Date: January 4, 2025
Duration: 29 minutes (07:01pm - 07:30pm CET)
Severity: High - Complete service unavailability
Affected Services: Website, API
Summary
On January 4th at 07:01pm CET, all services became inaccessible through our primary domain due to a configuration error introduced during routine maintenance. While the application backend remained operational throughout the incident, users were unable to access any services via the domain. Full service was restored at 07:30pm CET.
Timeline (CET)
- 07:01pm - Incident detected: Website and API reported as inaccessible
- 07:01pm - Incident created, investigation begun
- 07:20pm - Root cause identified: Configuration changes affecting domain routing
- 07:20pm - Status updated: Confirmed application running, issue isolated to domain access layer
- 07:30pm - Configuration corrected, services restored
- 07:30pm - Incident resolved, accessibility confirmed
Root Cause
During maintenance work to resolve Grafana monitoring dashboard accessibility issues, configuration changes were applied to the routing layer. These changes unintentionally affected the primary application domain routing rules, preventing all external access to services while the application itself continued running normally.
Impact
- User Impact: All users unable to access website and API services for 29 minutes
- Data Impact: None - no data loss or corruption occurred
- System Impact: Application backend remained operational throughout
Resolution
The configuration changes were identified and reverted/corrected, restoring proper domain routing to all services. Access was verified and confirmed operational at 07:30pm CET.
Action Items
Immediate
- Services restored to full operation
Short-term
- Review and document proper configuration change procedures for routing rules
- Implement configuration validation checks before applying changes
- Establish rollback procedures for routing configuration changes
Long-term
- Consider implementing staging environment for testing configuration changes
- Add automated monitoring alerts for domain accessibility issues
- Create runbook for domain routing troubleshooting
Lessons Learned
What Went Well
- Issue was detected immediately
- Root cause was identified within 20 minutes
- Clear communication maintained via status page
- Application stability was maintained throughout
What Could Be Improved
- Configuration changes should be tested in isolation before production deployment
- Need better safeguards to prevent routing changes from affecting multiple services simultaneously
- Pre-deployment validation could have caught the misconfiguration
Affected services
Updated
Jan 04 at 07:30pm CET
All services have been restored and are now fully accessible through our domain. The DNS/routing issue has been resolved and connectivity has been confirmed.
Affected services
Updated
Jan 04 at 07:20pm CET
We are currently experiencing connectivity issues with our domain. The application itself is running normally, but users are unable to access services through our primary domain name.
Affected services
Created
Jan 04 at 07:01pm CET
We are currently experiencing an outage affecting both our website and API services. Our team has been notified and is actively investigating the root cause.
Affected services