HHAeXchange
Platform Engineer
Vaga remota de Platform Engineering com fit claro de localização do candidato.
Publicada2 de jul. de 2026
Países elegíveis1 país aceito
Sinal de senioridadeSenior
Modelo de trabalhoRemoto
Locais aceitos para candidatos
Estados Unidos
Resumo da vaga
Platform Engineer
Requisitos e responsabilidades
Conteúdo da vaga extraído em seções para revisão mais rápida.
Essential Job Duties
- Own availability, latency, and performance targets for AI platform services and data infrastructure running on AWS
- Design and implement monitoring, alerting, and observability frameworks across the platform stack
- Lead incident response, root cause analysis, and post-mortem processes for platform-level outages or degradations
- Define and track SLOs/SLAs for core platform primitives including RAG pipelines, agent orchestration services, and model access layers
- Proactively identify reliability risks and drive engineering improvements before they become production issues
- Build and maintain runbooks, disaster recovery procedures, and operational documentation
- Design, build, and maintain CI/CD pipelines for AI platform components, data pipelines, and internal applications
- Own infrastructure-as-code (IaC) practices across the team using tools such as Terraform or AWS CDK
- Manage and optimize AWS environments including ECS, Lambda, S3, RDS, Redshift, API Gateway, and related services
- Implement and enforce security, compliance, and cost optimization best practices across AWS infrastructure
- Automate deployment, scaling, and configuration management to reduce manual operational overhead
- Partner with AI Platform Engineers to containerize and operationalize AI services and agent frameworks
- Support Data & AI Engineers with environment management, access controls, and deployment tooling for Polaris and data pipeline infrastructure
- Serve as the operational backbone for the AI platform team – ensuring engineers can ship and iterate quickly without being blocked by infrastructure concerns
- Contribute to our “factory model” vision by making deployment and reliability a repeatable, scalable capability rather than an ad hoc function
Details
- Own availability, latency, and performance targets for AI platform services and data infrastructure running on AWS
- Design and implement monitoring, alerting, and observability frameworks across the platform stack
- Lead incident response, root cause analysis, and post-mortem processes for platform-level outages or degradations
- Define and track SLOs/SLAs for core platform primitives including RAG pipelines, agent orchestration services, and model access layers
- Proactively identify reliability risks and drive engineering improvements before they become production issues
- Build and maintain runbooks, disaster recovery procedures, and operational documentation
- Design, build, and maintain CI/CD pipelines for AI platform components, data pipelines, and internal applications
- Own infrastructure-as-code (IaC) practices across the team using tools such as Terraform or AWS CDK
- Manage and optimize AWS environments including ECS, Lambda, S3, RDS, Redshift, API Gateway, and related services
- Implement and enforce security, compliance, and cost optimization best practices across AWS infrastructure
- Automate deployment, scaling, and configuration management to reduce manual operational overhead
- Partner with AI Platform Engineers to containerize and operationalize AI services and agent frameworks
- Support Data & AI Engineers with environment management, access controls, and deployment tooling for Polaris and data pipeline infrastructure
- Serve as the operational backbone for the AI platform team – ensuring engineers can ship and iterate quickly without being blocked by infrastructure concerns
- Contribute to our “factory model” vision by making deployment and reliability a repeatable, scalable capability rather than an ad hoc function
- Design, build, and maintain CI/CD pipelines for AI platform components, data pipelines, and internal applications
- Own infrastructure-as-code (IaC) practices across the team using tools such as Terraform or AWS CDK
- Manage and optimize AWS environments including ECS, Lambda, S3, RDS, Redshift, API Gateway, and related services
- Implement and enforce security, compliance, and cost optimization best practices across AWS infrastructure
- Automate deployment, scaling, and configuration management to reduce manual operational overhead
- Partner with AI Platform Engineers to containerize and operationalize AI services and agent frameworks
- Support Data & AI Engineers with environment management, access controls, and deployment tooling for Polaris and data pipeline infrastructure
- Serve as the operational backbone for the AI platform team – ensuring engineers can ship and iterate quickly without being blocked by infrastructure concerns
- Contribute to our “factory model” vision by making deployment and reliability a repeatable, scalable capability rather than an ad hoc function
- Serve as the operational backbone for the AI platform team – ensuring engineers can ship and iterate quickly without being blocked by infrastructure concerns
- Contribute to our “factory model” vision by making deployment and reliability a repeatable, scalable capability rather than an ad hoc function
- 3+ years of professional experience in a DevOps, SRE, or platform engineering role
- Hands-on AWS experience required – AgentCore, Bedrock, ECS, Lambda, S3, RDS, Redshift, CloudWatch, IAM, VPC, and related services
- Experience with infrastructure-as-code tools such as Terraform or AWS CDK
- Strong CI/CD experience with tools such as GitHub Actions
- Experience with containerization and orchestration (Docker, ECS, or Kubernetes)
- Familiarity with AI/ML infrastructure patterns – model serving, vector databases, pipeline orchestration (strongly preferred)
- Experience with observability and monitoring tooling (Datadog, CloudWatch)
- Prior experience in a SaaS environment
- Strong verbal and written communication skills with ability to collaborate across technical and non-technical stakeholders
- Self-starter with a proactive approach to identifying and resolving infrastructure risk before it impacts delivery
- Willingness to explore and adopt AI tools responsibly to enhance productivity and innovation in your role.
Other Job Duties
- Other duties as assigned by supervisor or HHAeXchange leader.
Travel Requirements
- Travel up to 10%, including overnight travel
Required Education, Experience, Certifications and Skills
- 3+ years of professional experience in a DevOps, SRE, or platform engineering role
- Hands-on AWS experience required – AgentCore, Bedrock, ECS, Lambda, S3, RDS, Redshift, CloudWatch, IAM, VPC, and related services
- Experience with infrastructure-as-code tools such as Terraform or AWS CDK
- Strong CI/CD experience with tools such as GitHub Actions
- Experience with containerization and orchestration (Docker, ECS, or Kubernetes)
- Familiarity with AI/ML infrastructure patterns – model serving, vector databases, pipeline orchestration (strongly preferred)
- Experience with observability and monitoring tooling (Datadog, CloudWatch)
- Prior experience in a SaaS environment
- Strong verbal and written communication skills with ability to collaborate across technical and non-technical stakeholders
- Self-starter with a proactive approach to identifying and resolving infrastructure risk before it impacts delivery
- Willingness to explore and adopt AI tools responsibly to enhance productivity and innovation in your role.
Vagas similares
Mantenha uma lista reserva.
AWS, CI/CD 13 países aceitos
Senior QA Automation EngineerSubway EcommerceVer vaga AWS, Kubernetes 13 países aceitos
Senior Backend Engineer (AdTech)Leap ToolsVer vaga AWS, Kubernetes 13 países aceitos
Senior Backend EngineerLeap ToolsVer vaga AWS 13 países aceitos
Senior Software EngineerBaltimore BannerVer vaga Stack
Use estas tags para comparar vagas remotas similares.
Elegibilidade de localização
Candidatos devem aplicar apenas quando o país do perfil estiver listado aqui.
Seu perfilPaís não definidoEntre para comparar seu país com esta vaga.
Fluxo de contratação
O WithMira mostra a vaga e depois envia candidatos para a aplicação da empresa.
1Confira fit da vaga, stack e elegibilidade de localização no WithMira.
2Abra a página de aplicação da empresa pelo link rastreado.
3Salve a vaga ou assine oportunidades similares antes de sair.