Resumo da vaga

Staff Software Engineer I- SRE

Requisitos e responsabilidades

Conteúdo da vaga extraído em seções para revisão mais rápida.

What You Will Do:

  • Proactive Reliability Engineering (~75% of role) · Analyze systemic failure patterns and design improvements that prevent incident recurrence · Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments · Build tooling and automation to reduce incident response toil and scale team impact · Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack · Analyze reliability data to identify systemic improvements; build dashboards that drive action · Explore AI-assisted approaches to documentation quality and incident analysis · Design scalable reliability standards that reduce reactive workload over time.
  • Incident Management Program (~25% of role) · Own standards, practices, and continuous improvement of incident response · Serve as an on-call Incident Commander for production incidents, including acting as escalation IC when incidents exceed a team's management chain · Develop and deliver training programs for engineering teams at all levels · Coach teams through post-mortems and on developing actionable corrective actions
  • Customer Root Cause Analysis (CRCA) · Edit and review customer-facing incident documents to ensure quality and clarity · Drive turnaround SLAs while maintaining technical accuracy · Ensure clear explanation of what happened, why, and how we'll prevent recurrence
  • Cross-Team Leadership · Partner with engineering leaders to elevate reliability practices · Be the expert who teams proactively engage for guidance

What You Will Bring:

  • 10+ years in SRE, incident management, or reliability engineering · Cloud experience with at least one of AWS, GCP, or Azure·
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
  • Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
  • Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues · Kubernetes and container orchestration experience · Understanding of CI/CD pipelines and release processes · Systems thinking: understanding how infrastructure design choices affect failure modes and recovery · Familiarity with SLO/SLA frameworks.
  • Track record as a trusted advisor across engineering organizations · Experience driving org-wide process and cultural changes · Strong written communication (design docs, one-pagers, runbooks) · Post-mortem facilitation experience · Experience with async collaboration across time zones
  • Large company experience navigating reliability/incident programs at 500+ engineer organizations·

What Gives You an Edge:

  • Multi-cloud experience (minimum 2+ of AWS/GCP/Azure).
  • Modern CI/CD, GitHub, AI-assisted workflows—you'll have the freedom to build what you need.
Vagas similares

Mantenha uma lista reserva.

Ver stack
FocoEngineeringÁrea da vaga
Sinal de senioridadeLeadNível do candidato
StackAWS, Azure, CI/CDSkills principais
Localização1 país aceitoElegibilidade

Stack

Use estas tags para comparar vagas remotas similares.

Elegibilidade de localização

Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.

Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.

Fluxo de contratação

O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.

1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.
Aplicar no site da empresaSite da empresaAbrir link