General Dynamics Mission Systems
Site Reliability Engineer
Rol remoto de Site Reliability Engineering con fit claro de ubicación del candidato.
Publicado19 jun 2026
Países elegibles1 país aceptado
Señal de senioritySenior
Modelo de trabajoRemoto
Ubicaciones aceptadas para candidatos
Estados Unidos
Resumen del rol
Site Reliability Engineer
Requisitos y responsabilidades
Contenido del rol extraído en secciones para revisar más rápido.
What You'll Own
- SLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions — not just measure uptime.
- Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do.
- Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again.
- Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production."
- Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs — tokens, compute, storage. You ensure the platform scales without surprises.
- Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that.
What You Won't Own
- Application development or AI model building — you ensure what they build is operable, you don't build it
- Infrastructure provisioning — IT provides the infrastructure; you define what's needed and validate it works
- Business process decisions or backlog prioritization
What Makes This Role Different
- AI services have failure modes that traditional applications don't — model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered.
- You are applying SRE principles from scratch. There is no existing SRE practice to inherit — you will define it for the platform.
- Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."
Required Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master’s degree plus 3 years of experience
- Production SRE or DevOps experience — you have owned the reliability of systems that real users depended on, not just built CI/CD pipelines
- Hands-on experience with monitoring and observability tools — Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar. You have built dashboards and alerts that caught real problems.
- Strong scripting and automation skills — Python, Bash, infrastructure-as-code (Terraform, CloudFormation, or similar)
- Experience with containerized environments — Docker, Kubernetes, container orchestration at scale
- Experience defining and managing SLOs, error budgets, and incident response procedures in production
- U.S. citizenship required. Department of Defense Secret security clearance is required at time of hire.
Preferred Qualifications
- Experience with AI/ML production systems — model serving, inference monitoring, token cost tracking, or similar
- Multi-cloud experience (AWS, Azure, GCP) including cloud-native monitoring and logging services
- Experience building operational readiness review processes or production launch checklists
- Familiarity with Google SRE principles — you have read the book and applied the concepts, not just referenced them in interviews
- Experience in environments where reliability has compliance or safety implications — defense, healthcare, finance, or critical infrastructure
What Sets You Apart
- You think about failure before you think about features. Your first question about any new system is "how does this break?"
- You automate yourself out of toil. If you're doing the same thing twice, you write a script.
- You have said "not ready" to a team that wanted to ship, and you were right.
- You build monitoring that tells you what's wrong, not just that something is wrong.
- You write post-incident reviews that actually change how systems are built, not just how incidents are documented.
Details
- Remote — 100% telework
- 9/80 schedule
- Defense industry experience is not required
Roles similares
Mantén una lista de respaldo.
AWS, Kubernetes 1 país aceptado
Senior Backend Engineer (AdTech)Leap ToolsVer rol AWS, Kubernetes 1 país aceptado
Senior Backend EngineerLeap ToolsVer rol CI/CD, Python 8 países aceptados
Application Security EngineerMorgan StanleyVer rol AWS, Azure 8 países aceptados
Senior DevOps EngineerFionetVer rol Stack
Usa estas tags para comparar roles remotos similares.
Elegibilidad de ubicación
Candidatos deberían aplicar solo cuando el país del perfil aparece aquí.
Tu perfilPaís no definidoInicia sesión para comparar tu país con este rol.
Flujo de contratación
WithMira muestra el rol y luego envía candidatos a la aplicación de la empresa.
1Revisa fit del rol, stack y elegibilidad de ubicación en WithMira.
2Abre la página de aplicación de la empresa desde el link rastreado.
3Guarda el rol o suscríbete a oportunidades similares antes de salir.