Zen Crypted Reliability Engineer
Statement of Work — Service Reliability Engineer (SRE)
Project: Ensuring 24/7 availability, resilience and operational excellence of the secure military-grade instant messaging platform
Position: Senior Service Reliability Engineer (SRE)
Project context: Zen Crypted is building x509.chat — a defense/government-grade secure communications ecosystem. The platform is powered by a custom ASN.1/DER-encoded protocol over TCP/QUIC with X.509 CMS envelope encryption, full PKI validation, ephemeral messaging and strict compliance with RFC 5280, 5652, 8551, ДСТУ 4145 and other military standards. The backend is implemented in Elixir/Erlang/OTP with Mnesia persistence. The SRE role is critical for keeping this high-stakes system running at extreme reliability levels required by defense and government customers.
Scope of Work (main deliverables):
Monitoring, Observability & Alerting:
- Design and implement comprehensive observability stack (Prometheus + Telemetry + Grafana + OpenTelemetry) for the Elixir/Erlang backend and QUIC transport layer
- Build custom metrics, traces and logs for ASN.1 protocol events, crypto operations, message delivery latency and PKI validation flows
- Establish SLOs/SLIs/SLAs tailored to military-grade requirements (99.99%+ uptime, sub-second delivery, zero-downtime certificate rotation)
- Create intelligent alerting and on-call escalation workflows with PagerDuty-style rotation
Reliability Engineering & Chaos Resilience:
- Implement automated chaos engineering and failure-injection testing for critical paths (network partitions, crypto module failures, Mnesia overload, QUIC connection drops)
- Develop automated remediation playbooks and self-healing mechanisms (supervision tree tuning, circuit breakers, rate limiting)
- Capacity planning and horizontal scaling strategies for high-load defense scenarios
- Post-incident reviews (blameless post-mortems) and reliability improvement backlog
Infrastructure & Release Reliability:
- Own and harden CI/CD pipelines, mix release process, Alpine init-supervised packaging and zero-downtime deployment strategies.
- Infrastructure-as-Code and GitOps practices for all production environments
- Secure distribution and secret management compliant with defense standards
- Performance benchmarking and load testing under realistic military network conditions
Security Operations & Compliance:
- Integrate security telemetry with crypto audit logs and OCSP/CRL monitoring
- Support external security audits, pentests and compliance certifications
- Maintain FIPS-like operational mode and side-channel resistance at infrastructure level
- Disaster recovery and business continuity planning for classified environments
Collaboration & Knowledge Transfer:
- Work closely with Backend Engineers and the Product Architect to embed reliability practices into design and code reviews
- Document runbooks, operational procedures and reliability architecture
- Train the team on SRE principles and golden signals
Required skills & experience (for job/CV screening):
- 5+ years of commercial SRE or DevOps experience in high-availability, security-critical systems
- Strong production experience with Elixir/Erlang/OTP (BEAM VM tuning, Mnesia/DETS, supervision trees, Telemetry)
- Deep hands-on knowledge of observability tools (Prometheus, Grafana, OpenTelemetry, ELK or equivalent)
- Practical experience with QUIC, TCP networking, Docker, CI/CD and infrastructure automation
- Understanding of cryptography in production (X.509, CMS, OCSP, PKI) and secure protocol operations
- Experience in defense, government or high-security environments (zero-trust, audit logging, compliance with RFCs / ДСТУ)
- Proficiency in Linux systems, networking and performance analysis
- English (Upper-Intermediate) + Ukrainian (advantage)
- Master’s degree or higher in Computer Science, Mathematics or related field
Nice to have:
- Familiarity with the N2O.DEV / ERP.UNO open-source stack
- Experience with post-quantum cryptography operations or MLS (Messaging Layer Security)
- Background in formal verification or high-assurance systems
- Previous work on military-grade chat/messaging platforms
Estimated engagement & Success criteria:
- Estimated engagement: Full-time / 6–12 months initial contract with extension option
- Success criteria:
- Defined and met SLOs with <0.01% error budget burn
- Zero unplanned downtime during security audits and field trials
- Automated on-call response time <15 minutes and MTTR <30 minutes
- Comprehensive observability coverage and living reliability documentation
- Successful hand-off of production operations to internal team