Backblaze
Sr. Site Reliability Engineer
Vaga remota de Site Reliability Engineering com fit claro de localização do candidato.
Publicada1 de jul. de 2026
Países elegíveis1 país aceito
Sinal de senioridadeSenior
Modelo de trabalhoRemoto
Locais aceitos para candidatos
Estados Unidos
Resumo da vaga
Sr. Site Reliability Engineer
Requisitos e responsabilidades
Conteúdo da vaga extraído em seções para revisão mais rápida.
What You'll Do:
- Service Reliability & Operations
- Own and drive the availability, durability, and performance of critical services across all production environments.
- Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
- Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
- Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
- Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
- Automation & Tooling
- Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
- Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
- Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
- Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
- Collaboration
- Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
- Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
- Lead capacity planning and disaster recovery strategy across critical infrastructure components.
- Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
- Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
- Continuous Improvement
- Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
- Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
- Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
Details
- Own and drive the availability, durability, and performance of critical services across all production environments.
- Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
- Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
- Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
- Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
- Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
- Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
- Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
- Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
- Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
- Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
- Lead capacity planning and disaster recovery strategy across critical infrastructure components.
- Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
- Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
- Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
- Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
- Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- 8+ years of progressive experience in site reliability, systems engineering, or operations.
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
- Expert-level Linux systems administration and advanced troubleshooting skills.
- Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
- Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
- Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
- Expert knowledge of incident response methodologies and operational best practices.
- Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
- Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
- Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
- Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
- Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
- Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.
Qualifications:
- Education & Experience
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- 8+ years of progressive experience in site reliability, systems engineering, or operations.
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
- Technical Skills
- Expert-level Linux systems administration and advanced troubleshooting skills.
- Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
- Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
- Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
- Expert knowledge of incident response methodologies and operational best practices.
- Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
- Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
- Preferred Attributes
- Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
- Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
- Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
- Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.
Backblaze Perks:
- Healthcare for family, including dental and vision
- Competitive compensation and 401K
- RSU grants for full-time employees
- ESPP program
- Flexible vacation policy
- Maternity & paternity leave
- MacBook Pro to use for work, plus a generous stipend to personalize your workstation
- Childcare bonus (human children only)
- Fertility treatment and support
- Learning & development program
- Commuter benefits
- Culture that supports a healthy work-life balance
Vagas similares
Mantenha uma lista reserva.
AWS, Kubernetes 13 países aceitos
Senior Backend Engineer (AdTech)Leap ToolsVer vaga AWS, Kubernetes 13 países aceitos
Senior Backend EngineerLeap ToolsVer vaga AWS, CI/CD 13 países aceitos
Senior QA Automation EngineerSubway EcommerceVer vaga CI/CD, Python 8 países aceitos
Application Security EngineerMorgan StanleyVer vaga Stack
Use estas tags para comparar vagas remotas similares.
Elegibilidade de localização
Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.
Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.
Fluxo de contratação
O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.
1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.