QAD, Inc.

Sr. Site Reliability Engineer- SRE

Rol remoto de Site Reliability Engineering con fit claro de ubicación del candidato.

Publicado19 jun 2026

Países elegibles1 país aceptado

Señal de senioritySenior

Modelo de trabajoRemoto

Ubicaciones aceptadas para candidatos

España

AWS CI/CD Kubernetes Python

Puedo aplicar realmente?Revisa la lista de países

Las ubicaciones aceptadas para candidatos están listadas (1).

Actualidad de la fuente19 jun 2026

Fit de ubicación1 país aceptado

Match de stackAWS, CI/CD

Camino de aplicaciónSitio de la empresa

Resumen de fit de MiraPor qué vale revisar este rol

Fit de ubicación1 país aceptadoAgrega tu país

Match de stackAgrega skills al perfil para compararAWS, CI/CD

Señal de senioritySeniorDefine tu nivel para una revisión más precisa.

Preparación para aplicarSitio de la empresaLa aplicación continúa en el sitio de la empresa.

Aplicación

Aplicar en el sitio de la empresa

Aplicación externa

Aplicando aSr. Site Reliability Engineer- SREQAD, Inc.

Fit de país1 país aceptado

Camino de aplicaciónSitio de la empresa

WithMiraGuarda o suscríbete antes de salir

Aplicación de la empresa

WithMira mantiene este rol para descubrimiento. La aplicación continúa en el sitio de la empresa.

Aplicar en el sitio de la empresa

Guardar rol

Resumen del rol

Sr. Site Reliability Engineer- SRE

Requisitos y responsabilidades

Contenido del rol extraído en secciones para revisar más rápido.

What You'll Do:

Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience.
Datadog Expert: Be one of the go-to experts for Datadog. You will be responsible for defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing across our entire AWS environment. This includes deep hands-on configuration, dashboarding, troubleshooting, and optimization within Datadog.
https://www.smartrecruiters.com/app/jobs/details/1a099a5c-2719-44ea-b9fb-43833ab4f60f/jobad/726f1bba-3ffb-4544-a5ec-d689eea24fc0 1/4
5/29/26, 10:48 AM Job • SmartRecruiters
Software Development for Reliability: Develop robust, well-tested, and maintainable software and tooling to automate operational tasks, create self-service capabilities for engineering teams, and enhance system reliability. This will involve building applications, not just scripts. Toil Reduction Champion: Identify and eliminate toil through automation, process improvements, and systematic problem-solving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
Incident Management & Post-Mortems: Contribute to and evolve our incident response framework, participating in on-call rotations (using OpsGenie). Lead blameless post-mortems, extracting actionable insights and driving systemic improvements to prevent recurrence. Reliability Metrics & Goals: Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. Use these metrics to drive continuous improvement and make data-driven decisions about reliability investments. Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts, moving towards a fully automated environment using Terraform and GitHub Actions.
System Design & Architecture: Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability into our services from the ground up.
Knowledge Sharing & Mentorship: Document processes, build runbooks, and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.

Core SRE Capabilities

Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role.
Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains.
Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis.
Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions.
Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences.

Experience across several of the following areas:

Kubernetes platforms, including Amazon EKS, and service mesh technologies such as Istio.
Cloud infrastructure and services within AWS.
Identity and access management systems, including Auth0 and AWS IAM.\
Networking fundamentals, including DNS, load balancing, routing, TLS, and connectivity troubleshooting.
GitOps workflows and infrastructure automation using tools such as Flux and Terraform.
Observability platforms and practices, including metrics, logs, traces, alerting, dashboards, and synthetic monitoring.
CI/CD systems and engineering workflows.
Application logging and distributed system debugging.
Engineering Mindset

A strong SRE:

Prioritizes service stability and customer impact during incidents.
Slows down under pressure, gathers facts, and communicates clearly.
Reduces operational complexity through automation and simplification.
Identifies and eliminates toil through self-service tooling and process improvement.
Demonstrates strong scripting and automation instincts.
Brings a systems-thinking approach to problem-solving.
Balances short-term remediation with long-term reliability improvements.

Software Engineering for Reliability

Demonstrated ability to build and maintain automation, tooling, and self-service capabilities using one or more programming or scripting languages such as Python, Go, or Bash.
Focuses on applying software engineering practices to improve reliability, reduce toil, and enhance developer productivity. Behavioral Expectations
Calm and effective during high-severity incidents.
Skilled at managing complex situations involving multiple teams and competing priorities.
Able to lead blameless post-mortems and drive meaningful follow-up actions.
Passionate about continuous improvement and fostering a culture of shared ownership.