General Dynamics Mission Systems

Site Reliability Engineer

Rol remoto de Site Reliability Engineering con fit claro de ubicación del candidato.

Publicado19 jun 2026

Países elegibles1 país aceptado

Señal de senioritySenior

Modelo de trabajoRemoto

Ubicaciones aceptadas para candidatos

Estados Unidos

Puedo aplicar realmente?Revisa la lista de países

Las ubicaciones aceptadas para candidatos están listadas (1).

Actualidad de la fuente19 jun 2026

Fit de ubicación1 país aceptado

Match de stackAWS, Azure

Camino de aplicaciónSitio de la empresa

Resumen de fit de MiraPor qué vale revisar este rol

Fit de ubicación1 país aceptadoAgrega tu país

Match de stackAgrega skills al perfil para compararAWS, Azure

Señal de senioritySeniorDefine tu nivel para una revisión más precisa.

Preparación para aplicarSitio de la empresaLa aplicación continúa en el sitio de la empresa.

Aplicación

Aplicar en el sitio de la empresa

Aplicación externa

Aplicando aSite Reliability EngineerGeneral Dynamics Mission Systems

Fit de país1 país aceptado

Camino de aplicaciónSitio de la empresa

WithMiraGuarda o suscríbete antes de salir

Aplicación de la empresa

WithMira mantiene este rol para descubrimiento. La aplicación continúa en el sitio de la empresa.

Resumen del rol

Contenido del rol extraído en secciones para revisar más rápido.

SLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions — not just measure uptime.
Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do.
Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again.
Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production."
Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs — tokens, compute, storage. You ensure the platform scales without surprises.
Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that.

Application development or AI model building — you ensure what they build is operable, you don't build it
Infrastructure provisioning — IT provides the infrastructure; you define what's needed and validate it works
Business process decisions or backlog prioritization

AI services have failure modes that traditional applications don't — model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered.
You are applying SRE principles from scratch. There is no existing SRE practice to inherit — you will define it for the platform.
Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."

Bachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master’s degree plus 3 years of experience
Production SRE or DevOps experience — you have owned the reliability of systems that real users depended on, not just built CI/CD pipelines
Hands-on experience with monitoring and observability tools — Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar. You have built dashboards and alerts that caught real problems.
Strong scripting and automation skills — Python, Bash, infrastructure-as-code (Terraform, CloudFormation, or similar)
Experience with containerized environments — Docker, Kubernetes, container orchestration at scale
Experience defining and managing SLOs, error budgets, and incident response procedures in production
U.S. citizenship required. Department of Defense Secret security clearance is required at time of hire.

Experience with AI/ML production systems — model serving, inference monitoring, token cost tracking, or similar
Multi-cloud experience (AWS, Azure, GCP) including cloud-native monitoring and logging services
Experience building operational readiness review processes or production launch checklists
Familiarity with Google SRE principles — you have read the book and applied the concepts, not just referenced them in interviews
Experience in environments where reliability has compliance or safety implications — defense, healthcare, finance, or critical infrastructure

You think about failure before you think about features. Your first question about any new system is "how does this break?"
You automate yourself out of toil. If you're doing the same thing twice, you write a script.
You have said "not ready" to a team that wanted to ship, and you were right.
You build monitoring that tells you what's wrong, not just that something is wrong.
You write post-incident reviews that actually change how systems are built, not just how incidents are documented.