Resumo da vaga

Site Reliability Engineer

Requisitos e responsabilidades

Conteúdo da vaga extraído em seções para revisão mais rápida.

You’ll make reliability the default

  • You’ll design and maintain infrastructure that is highly available, fault-tolerant, and scalable
  • You’ll proactively identify and eliminate single points of failure before they become incidents

You’ll make reliability the default

  • You’ll ensure our production systems remain stable, even under increasing scale and load

You’ll own and optimize our cloud environments

  • You’ll manage and continuously improve workloads across AWS, GCP, or Azure
  • You’ll use Infrastructure as Code (Terraform) to standardize and scale infrastructure
  • You’ll optimize resource usage to balance performance and cost

You’ll run and improve Kubernetes in production

  • You’ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence
  • You’ll troubleshoot issues quickly and ensure smooth deployments and upgrades
  • You’ll ensure our containerized workloads perform reliably at scale

You’ll run and improve Kubernetes in production

  • You’ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELK
  • You’ll define alerting that is meaningful, not noisy
  • You’ll respond to incidents, lead root cause analysis, and ensure we learn from every failure

You’ll run and improve Kubernetes in production

  • You’ll write scripts and build tooling to eliminate repetitive operational work
  • You’ll continuously improve infrastructure efficiency through automation
  • You’ll promote a culture where manual work is a temporary state, not the norm

You’ll collaborate to improve the entire system

  • You’ll work closely with DevOps and engineering teams to solve performance bottlenecks
  • You’ll contribute to CI/CD improvements and deployment reliability
  • You’ll help shape reliability best practices across the organization

First 30 days:

  • You’ve built a strong understanding of our infrastructure, systems, and workflows
  • You’re contributing to day-to-day operations with support from the team
  • You’ve started identifying areas for improvement in automation and reliability

By 90 days:

  • You’re independently managing infrastructure tasks and troubleshooting issues
  • You’re actively contributing to reliability and scalability improvements
  • You’ve taken ownership of parts of our infrastructure and are improving them

Who You Are

  • You’ve spent ~3 years working in SRE, DevOps, or infrastructure engineering, and you’ve seen what breaks at scale
  • You’re comfortable working in cloud environments like AWS, GCP, or Azure—and you understand how distributed systems behave
  • You’ve worked hands-on with Kubernetes in production and know how to troubleshoot it when things go wrong
  • You don’t just fix issues - you ask why they happened and make sure they don’t happen again

Technically, you likely:

  • Use Terraform (or similar IaC tools) to manage infrastructure
  • Work confidently with Docker and Kubernetes
  • Write scripts in Python, Bash, or similar to automate workflows
  • Understand CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.)
  • Have a solid grasp of networking, load balancing, and high-availability design

When it comes to monitoring:

  • You’ve implemented tools like Prometheus, Grafana, Datadog, or ELK
  • You know the difference between useful alerts and noise
  • You focus on signals that actually drive action

What sets you apart:

  • You take ownership - you don’t wait to be told something is broken
  • You’re calm under pressure and methodical during incidents
  • You simplify complexity instead of adding to it
  • You communicate clearly, even when explaining deeply technical issues
  • You care about building systems that make other engineers more effective

Nice to Have (but not required)

  • Experience with RabbitMQ or Redis in production
  • Familiarity with Ansible or AWX
  • Exposure to multi-cloud or hybrid environments
  • Cloud certifications (AWS, GCP) or Linux certifications
  • Background from ITI (Information Technology Institute)

What the hiring process will look like

  • Screening Interview – Talent Acquisition
  • Technical Interview – SRE Lead
  • Technical Task
  • Final Interview – SRE Lead & Cloud DevOps Director
Vagas similares

Mantenha uma lista reserva.

Ver stack
FocoSite Reliability EngineerÁrea da vaga
Sinal de senioridadeSeniorNível do candidato
StackAWS, Azure, CI/CDSkills principais
Localização1 país aceitoElegibilidade

Stack

Use estas tags para comparar vagas remotas similares.

Elegibilidade de localização

Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.

Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.

Fluxo de contratação

O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.

1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.
Aplicar no site da empresaSite da empresaAbrir link