Resumo da vaga

Manager, Cloud Platform & Site Reliability

Requisitos e responsabilidades

Conteúdo da vaga extraído em seções para revisão mais rápida.

Details

  • Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement.
  • Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments.
  • Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews.
  • Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements.
  • Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through.
  • Translate recurring operational pain points and customer feedback into roadmap priorities, product improvements, and runbook enhancements across both teams.
  • Ensure best practices for CI/CD, infrastructure-as-code, GitOps, Kubernetes, and cloud resource management are consistently adopted and maintained across the org.
  • Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure requirements.
  • Navigate ambiguity and make sound architectural and organizational tradeoffs, avoiding unnecessary complexity while enabling your teams to move fast.
  • Demonstrate accountability, pride of ownership, and high standards — and expect the same from your leads and their teams.
  • Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.
  • Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment.
  • Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions.
  • Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm).
  • Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code.
  • Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through.
  • Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery.
  • Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences.
  • No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving.
  • Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving.
  • Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (e.g., on H100s or B200s).
  • Experience with incident management platforms (e.g., incident.io, PagerDuty) and building AI-assisted tooling for incident triage and response.
  • Experience scaling an SRE practice: defining runbook standards, building self-healing automations, and converting high-frequency failure patterns into systematic mitigations.
  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Fertility and family-building stipend through Carrot
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Vagas similares

Mantenha uma lista reserva.

Ver stack
FocoInfrastructureÁrea da vaga
Sinal de senioridadeLeadNível do candidato
StackCI/CD, Kubernetes, SparkSkills principais
Localização1 país aceitoElegibilidade

Stack

Use estas tags para comparar vagas remotas similares.

Elegibilidade de localização

Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.

Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.

Fluxo de contratação

O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.

1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.
Aplicar no site da empresaSite da empresaAbrir link