Resumen del rol

Staff Machine Learning Engineer- ML Training Infrastructure

Requisitos y responsabilidades

Contenido del rol extraído en secciones para revisar más rápido.

What You'll Do:

  • Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale.
  • Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments.
  • Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack.
  • Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery.
  • Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams.
  • Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem.
  • Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team.

Your Skills & Abilities (Required Qualifications)

  • Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience.
  • 7+ years of professional software engineering experience.
  • 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models.
  • Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems.
  • Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure).
  • Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact.
  • Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost.
  • Willingness to travel to Sunnyvale, CA as needed.
  • Comfortable operating in highly ambiguous and dynamic environments.

Your Skills & Abilities (Required Qualifications)

  • 7+ years of professional software engineering experience.
  • Deep expertise in PyTorch 2.x+ and distributed training frameworks.
  • Experience designing and developing training platforms that support FSDP, pipeline parallelism, and other scalable solutions for training large foundational models.
  • Experience profiling, analyzing, debugging, and optimizing training and data loading performance at scale.
  • Strong record of technical leadership through architecture reviews, roadmap influence, and cross-team execution.
  • Excellent communication skills, with the ability to build consensus, navigate controversial decisions, communicate risks clearly, and provide constructive technical feedback.
  • Self-motivated, execution-oriented, and motivated by delivering broad organizational impact.

Your Skills & Abilities (Required Qualifications)

  • The salary range for this role is $185,000 to $335,300. The actual base salary a successful candidate will be offered within this range will vary based on factors relevant to the position.

Your Skills & Abilities (Required Qualifications)

  • Bonus Potential: An incentive pay program offers payouts based on company performance, job level, and individual performance.

Benefits:

  • Benefits: GM offers a variety of health and wellbeing benefit programs. Benefit options include medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts and more.
Roles similares

Mantén una lista de respaldo.

Ver stack
FocoMachine Learning EngineeringÁrea del rol
Señal de senioritySeniorNivel del candidato
StackAWS, Azure, GCPSkills principales
Ubicación1 país aceptadoElegibilidad

Stack

Usa estas tags para comparar roles remotos similares.

Elegibilidad de ubicación

Candidatos deberían aplicar solo cuando el país del perfil aparece aquí.

Tu perfilPaís no definidoInicia sesión para comparar tu país con este rol.

Flujo de contratación

WithMira muestra el rol y luego envía candidatos a la aplicación de la empresa.

1Revisa fit del rol, stack y elegibilidad de ubicación en WithMira.
2Abre la página de aplicación de la empresa desde el link rastreado.
3Guarda el rol o suscríbete a oportunidades similares antes de salir.
Aplicar en el sitio de la empresaSitio de la empresaAbrir link