Resumo da vaga

Sr. Site Reliability Engineer

Requisitos e responsabilidades

Conteúdo da vaga extraído em seções para revisão mais rápida.

What You'll Do:

  • Service Reliability & Operations
  • Own and drive the availability, durability, and performance of critical services across all production environments.
  • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
  • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
  • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
  • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
  • Automation & Tooling
  • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
  • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
  • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
  • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
  • Collaboration
  • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
  • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
  • Lead capacity planning and disaster recovery strategy across critical infrastructure components.
  • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
  • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
  • Continuous Improvement
  • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
  • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
  • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.

Details

  • Own and drive the availability, durability, and performance of critical services across all production environments.
  • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
  • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
  • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
  • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
  • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
  • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
  • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
  • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
  • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
  • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
  • Lead capacity planning and disaster recovery strategy across critical infrastructure components.
  • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
  • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
  • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
  • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
  • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
  • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
  • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
  • Expert knowledge of incident response methodologies and operational best practices.
  • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
  • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
  • Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
  • Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
  • Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
  • Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.

Qualifications:

  • Education & Experience
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
  • Technical Skills
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
  • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
  • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
  • Expert knowledge of incident response methodologies and operational best practices.
  • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
  • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
  • Preferred Attributes
  • Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
  • Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
  • Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
  • Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.

Backblaze Perks:

  • Healthcare for family, including dental and vision
  • Competitive compensation and 401K
  • RSU grants for full-time employees
  • ESPP program
  • Flexible vacation policy
  • Maternity & paternity leave
  • MacBook Pro to use for work, plus a generous stipend to personalize your workstation
  • Childcare bonus (human children only)
  • Fertility treatment and support
  • Learning & development program
  • Commuter benefits
  • Culture that supports a healthy work-life balance
Vagas similares

Mantenha uma lista reserva.

Ver stack
FocoSite Reliability EngineeringÁrea da vaga
Sinal de senioridadeSeniorNível do candidato
StackAWS, Azure, CI/CDSkills principais
Localização1 país aceitoElegibilidade

Stack

Use estas tags para comparar vagas remotas similares.

Elegibilidade de localização

Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.

Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.

Fluxo de contratação

O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.

1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.
Aplicar no site da empresaSite da empresaAbrir link